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ABSTRACT 


We  consider  the  problem  of  efficiently  multiplying  matrices  on  distributed- 
memory,  message-passing  computers.  Block  decomposition  and  wavefront  comput- 
ing are  employed  to  yield  a communication-efficient  solution.  We  also  explore  in- 
terconnections between  distributed  data  structures,  physical  networks  of  nodes,  and 
virtual  networks  of  nodes  and  processes.  The  notion  of  wavefront  computing  is  ex- 
tended to  include  pipelining  of  data  between  nodes  of  networks  of  processes  as  well 
as  physical  networks  of  nodes.  Algorithms  are  developed  in  layers  to  facilitate  port- 
ing between  topologies  and  programming  environments;  we  also  show  how 
different  topologies  can  be  employed  in  a single  application.  A mathematical  char- 
acterization of  data-routing  for  efficient  matrix  multiplication  on  distributed- 
memory  machines  is  developed,  which  exhibits  a wavefront  version  of  the 
Dekel/Nassimi/Sahni  algorithm  as  a special  case.  We  also  discuss  the  notion  of 
granularity  in  this  context  and  use  it  to  distinguish  between  algorithms  on  grounds 
of  communication  complexity. 
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1.  Introduction. 


We  consider  here  the  problem  of  efficiently  computing  C = A * B for  M x M matrices  A and  B on  a mul- 
tiprocesssor  with  no  shared  memory.  We  assume  there  are  s2  nodes  available,  each  with  processor  and  private 
memory.  Communication  between  nodes  is  by  packet-switched  message-passing,  via  direct  links  between  nodes. 
Nodes  operate  asynchronously.  In  the  algorithms  we  develop  all  nodes  will  execute  the  same  program,  with 
masking  of  instructions.  Hence  the  full  power  of  MIMD  computing  is  not  used,  although  typically  it  is  available 
in  such  machines.  An  important  example  of  such  machines  is  the  hypercube  [21]. 

The  application  we  consider,  matrix  multiplication,  serves  as  a focal  point  for  a treatment  of  a number  of 
topics  related  to  message-passing  environments,  including: 

1.  Minimization  of  communication  cost. 

2.  Distributed  data  structures. 

3.  Wavefront  computing. 

4.  Networks  of  processes. 

5.  Topology  of  networks. 

6.  Topology  of  algorithms. 

7.  Changing  topologies. 

8.  Modular  software. 

9.  Characterization  of  relevant  algorithms. 

Minimization  of  communication  cost  is  critical  for  fine-grained  computing  in  particular;  here  this  occurs  for 
some  algorithms  when  M/s  is  fairly  small,  and  when  M/s 2 is  small  for  others.  We  will  return  to  this  in  Section 
13,  where  we  also  observe  the  effects  of  different  architectures  such  as  the  Mark  II  [24]  and  the  Intel  iPSC  [5]. 

Distributed  data  structures  play  an  important  role  in  minimizing  communication,  as  noted  in  [11],  They  are 
discussed  in  the  context  of  hypercubes  in,  e.g.,  [12],  [14],  [20].  In  [20]  a scheme  is  developed  to  permit 
distributed-memory  machines  to  simulate  shared  memory.  In  Section  7 we  note  that  this  would  simplify  some 
of  the  discussion  here,  although  we  do  not  incorporate  it.  In  [12]  the  term  areal  decomposition  is  used  for  the 
block  scheme  we  employ  here.  In  [3]  this  is  shown  to  be  optimal  for  the  algorithm  given  there  for  hypercubes. 
Block  decomposition  is  also  readily  adapted  to  the  Dekel/Nassimi/Sahni  algorithm  [1]  which  we  adopt  here. 
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Our  implementation  of  matrix  multiplication  is  via  wavefront  computing  ([9],  [10]).  This  involves  asynchro- 
nous pipelining  of  data  through  a processor  array.  It  employs  the  notion  of  dataflow  to  modify  the  systolic  ap- 
proach ([7], [8]).  Thus  computation  at  a node  is  triggered  by  receipt  of  operands;  there  are  no  synchronization 
points.  The  usage  of  wavefront  computing  in  the  context  of  hypercubes  has  been  noted  previously  (e.g.  [4]). 

We  also  extend  the  notion  of  wavefront  computing  by  permitting  data  to  be  pipelined  not  only  through  the 
physical  network  of  nodes,  but  also  through  a virtual  network  of  processes.  We  assume  that  matrix  multiplica- 
tion is  only  one  node  in  the  latter.  This  second  level  of  pipelining,  also  asynchronous,  may  further  reduce  com- 
munication delay.  In  particular,  if  processes  represent  steps  in  an  iterative  algorithm,  data  may  flow  continuously 
through  the  entire  algorithm,  avoiding  possibly  costly  synchronization  at  the  end  of  steps. 

We  regard  the  s2  nodes  as  forming  an  s x s array,  with  rows  and  columns  numbered  0,  ...  ,s-l.  Assume  that 
each  of  A,B,C  is  an  array[0..M-l,0..M-l]  of  some  unspecified  element  type.  Also  assume  for  simplicity  that  M 
= m * s for  an  integer  m.  We  partition  A,B,C  into  s x s arrays  of  blocks  { A(i J)) , (B(ij)},  [C(i j)) , where  a 
block  is  an  array[0..m-l,0..m-l].  In  expressions  such  as  A(ij),  parentheses  are  used  for  referencing  positions  in 
the  s x s array  of  blocks  forming  A,B,C,  or  in  the  s x s array  of  nodes.  Brackets  are  used  to  reference  elements 
of  data  arrays.  The  block  A(i  j)  is  defined  by 

A(i  j)[k,n]  = A[m*i+k,m*j+n]  (0  < k < m,  0 < n < m,  0 < i < s,  0 < j < s) 

and  similarly  for  B(i  j)  and  C(i  j).  The  awkward  notation  on  the  left  above  and  the  overloading  of  the  symbols 
A,B,C  will  cause  no  problem  since  we  will  never  again  refer  to  individual  elements  of  blocks. 

The  agents  for  the  distribution  of  { A(i j)}  and  (B(ij)}  will  be  assumed  to  be  previously  loaded  processes; 
i.e.  for  each  i and  j,  on  the  node  in  row  i,  column  j of  the  node  array  we  assume  a process  Asource(ij)  outputs 
A(ij),  and  a process  Bsource(ij)  outputs  B(ij).  We  will  construct  process  Multiply(ij),  to  be  loaded  to  the  same 
node,  which  will  receive  A(i j)  and  B(iJ)  and  output  C(i j).  Presumably  the  latter  becomes  input  for  yet  another 
process  loaded  to  the  same  node,  namely  Csink(i  j).  This  situation  might  arise,  for  example,  in  an  iterative  algo- 
rithm calling  matrix  multiplication  as  a subroutine.  The  initial  configuration  of  the  {(A(ij)}  and  { (B(i j)} , and 
the  final  configuration  of  the  {C(i j) } , are  an  important  part  of  Multiply;  in  fact  an  appropriate  way  of  viewing 
such  an  algorithm  is  via  state  transitions  of  such  configurations. 
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Collectively  we  refer  to  {Asource(ij)}  as  process  aiTay  Asource,  and  similarly  for  {Bsource(ij)}, 
(Multiply(ij)}  and  (Csink(ij)).  Presumably  Asource,  Bsource,  Multiply  and  Csink  form  part  of  a network  of 
process  arrays  hosted  by  the  underlying  s x s node  array.  The  latter  is  a virtual  entity  formed  from  the  underly- 
ing physical  network  of  nodes,  which  might  be  a hypercube,  toroidal  mesh  etc.  The  structure  of  the  network  of 
process  arrays  is  fixed  by  the  application,  and  we  assume  the  hardware  is  not  reconfigurable.  However,  we  can 
alter  the  configuration  of  nodes  in  the  virtual  s x s node  array  at  will.  Thus  each  process  array  can  be  hosted  by 
a private  node  array,  whose  structure  presumably  reflects  the  topology  of  the  process  array  in  some  sense.  Thus 
process  arrays  can  execute  on  virtual  machines. 

The  { A(i j)}  will  flow  through  rows  and  the  {B(i J)}  through  columns  of  the  node  array,  with  computation 
on  a node  triggered  by  the  receipt  of  operands.  The  process  arrays  Asource,  Bsource  and  Csink  may  themselves 
be  wavefront-based;  then  data  can  flow  asynchronously  from  Asource  and  Bsource  to  Multiply,  and  from  the 
latter  to  Csink.  Thus  the  flow  of  data  may  be  viewed  as  a three-dimensional  wavefront  through  the  network  of 
process  arrays  and  the  rows  and  columns  of  the  node  array,  as  illustrated  for  three  nodes  in  Fig.  1. 

Measured  in  terms  of  the  number  of  transfers  of  m x m blocks  between  neighboring  nodes,  the  wavefront 
approach  gives  a communication  cost  of  O(s)  for  any  physical  network  of  nodes  with  a toroidal  mesh  as  a sub- 
network, assuming  an  initial  configuration  of  data  as  described  above.  This  cost  estimate  is  essentially  indepen- 
dent of  the  hardware.  In  contrast,  the  algorithm  in  [3]  for  hypercubes,  which  is  synchronous  and  broadcast-based, 
may  accrue  a cost  of  up  to  0(s  * log  s)  depending  on  packet  size.  Even  if  the  hardware  is  augmented  with  sup- 
port for  broadcasting  through  rows  or  columns  of  the  virtual  node  array,  e.g.  by  broadcasting  to  subcubes  of  a 
cube,  the  cost  is  still  O(s).  The  wavefront  approach  is  much  more  cost-effective  in  this  event,  although  the  ad- 
vantages of  broadcast  are  greater  for  meshes  [22].  Hence  the  wavefront  approach  challenges,  at  least  with  re- 
gard to  matrix  multiplication,  the  recommendation  in  [2]  that  hardware  support  for  broadcasting  to  subcubes  be 
included  as  a standard  for  hypercubes. 

The  existence  of  algorithms  for  hypercubes  and  meshes  with  O(s)  communication  cost  can  be  deduced  from 
the  development  in  [1],  This  was  noted  by  Johnsson  [6],  who,  however,  specified  no  implementation.  The  attai- 
nability of  O(s)  cost  is  also  noted  in  [16]. 
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In  Section  7 we  note  that  algorithms  oriented  to  a certain  topology  may  be  ported  to  another  by  altering  the 
interface  to  the  network  of  process  arrays.  This  is  an  instance  of  the  development  of  modular  software  for 
message-passing  machines.  It  requires  considerable  caution,  especially  on  a machine  such  as  the  iPSC  where 
messages  are  qualified  by  a type  number  rather  than  the  identity  of  the  sender. 

In  the  same  spirit,  in  Section  6 we  develop  a mathematical  characterization  of  algorithms  for  matrix  multi- 
plication on  arrays  of  nodes,  independent  of  interconnection  topology.  Thus,  for  example,  we  are  able  to  view 
algorithms  for  toroidal  meshes  and  hypercubes  as  special  cases  of  a more  general  data-routing  paradigm. 

Finally,  we  note  in  passing  that  the  traditional  method  of  evaluating  the  efficiency  of  algorithms  does  not 
fully  apply  here.  Typically  matrix  multiplication  would  be  a phase  in  an  overlying  algorithm;  because  of  pipelin- 
ing of  data  between  processes,  the  start  and  end  of  a matrix  multiplication  phase  on  different  nodes  is  asynchro- 
nous. A process  representing  a single  phase  can  only  be  evaluated  meaningfully  in  terms  of  its  functioning  as  a 
node  in  the  network  of  process  arrays,  which  presumably  represents  the  overlying  algorithm.  More  generally,  it 
has  become  clear  that  new  models  need  to  be  developed  to  describe  algorithms  on  distributed-memory  machines; 
these  might,  e.g.,  focus  on  algorithms  as  manipulations  of  distributed  data  structures.  Classical  theories  which 
view  algorithms  as  mappings  from  input  to  output  are  clearly  inadequate  in  this  setting. 
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2.  Node  arrays. 

Here  we  formalize  the  notion  of  the  virtual  node  arrays  mentioned  in  the  previous  section.  Let  S = {0,1, ...  , 
s-1 } and  let  I be  the  identity  matrix  of  order  s.  Throughout  the  paper  the  symbols  s,  S and  I will  be  fixed.  We 
also  define  three  data  types: 

type  Boolean  = element  of  {0,1}; 

Nodelabel  = injective  mapping  from  S2  to  {0,1, ...  , s2-l  }; 

Permutation  = permutation  on  S; 

Letting  T*  denote  transpose  of  T,  a permutation  matrix  T on  S will  refer  to  an  array[0..s-l,0..s-l]  of  Boole- 
an with  T * T*  = I.  We  define  a bijection  Q from  the  set  of  Permutations  to  the  set  of  permutation  matrices  on 
S as  follows:  given  a Permutation  o,  define  a permutation  matrix  T on  S by  T[ij]  = 1 if  j = a(i)  and  0 other- 
wise; then  set  T = Q(a).  If  T = Q(a)  then  T*  = Q(ct-1  ). 

DEFINITION  2.1.  An  adjacency  matrix  of  degree  d is  a symmetric  array[0..s-l,0..s-l]  of  Boolean  with  ex- 
actly d ones  in  each  row  and  column,  or  equivalently  all  row  and  column  sums  equal  to  d. 

□ 

DEFINITION  2.2.  We  represent  node  arrays  by  ordered  pairs: 
type  Nodearray  = (Nodelabel,  adjacency  matrix); 

In  Nodearray  (P.E),  nodes  P(ij , j, ) and  P(i2,  j2)  are  adjacent  if  i j = i2  and  E[j  1,j2]  = 1,  orifjj  = j2  and 
E[i  j , i2  ] = 1. 

□ 

A Nodearray  (P,E)  may  be  interpreted  physically  as  an  s x s array  of  nodes  labeled  by  P;  i.e.  P(i,j)  is  the 
label  of  the  node  in  row  i,  column  j.  Adjacency  means  that  nodes  are  joined  directly  by  a serial  link;  the  sym- 
metry of  E means  the  links  are  bidirectional.  If  E has  degree  d,  then  each  node  is  adjacent  to  exactly  d nodes  in 
each  row  and  column,  and  no  others.  If  a row  or  column  is  viewed  as  an  undirected  graph  then  it  is  regular  of 
degree  d,  with  E determining  its  edges. 

Two  Nodearrays  (Pstan,Ehyp)  and  (Pgc.Ecyc)  with  s = 4 are  illustrated  in  Fig.  2,  along  with  a permutation 
matrix  T.  We  note  that  adjacencies  of  nodes  in  (Pstan,Ehyp)  and  (Pgc,Ecyc)  are  identical;  e.g.  node  9 is  adja- 
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cent  to  nodes  1,8,  11,  13  in  both  Nodearrays.  This  is  not  a coincidence.  In  Fig.  2 we  note  that  if  Pstan  and  Pgc 
are  treated  as  arrays,  Pstan  = T*  * Pgc  * T and  Ehyp  = T*  * Ecyc  * T.  In  general,  if  E and  E are  adjacency  ma- 
trices and  E[i,j]  < E[i j]  for  ij  e S write  EcE. 

DEFINITION  2.3.  Suppose  (P,E)  and  (P,E)  are  Nodearrays,  T is  a permutation  matrix  on  S,  P = T*  * P * 
T,  and  EeT**E*T.  Then  we  write  (P,E) c (P.E)  with  transformation  matrix  T. 

□ 

For  example,  if  (PJE)  and  (P,  E)  are  Nodearrays  and  EcE  then  (P,  E)  c (P,E)  with  transformation  matrix 
I. 

LEMMA  2.1.  If  E and  E are  adjacency  matrices  with  EcE  and  T is  a permutation  matrix  on  S,  then  T*  * 
E*  T c T*  * E * T. 

Proof:  immediate  from  T[ij]  > 0 for  all  ij  e S. 

□ 

DEFINITION  2.4.  If  (P,E)  and  (P  ,E)  are  Nodearrays,  T is  a permutation  matrix  on  S,  P = T*  * P * T,  and 
E = T*  * E * T,  we  say  (P,E)  is  equivalent  to  (P,E)  with  transformation  matrix  T. 

□ 

LEMMA  2.2.  If  (P,E)  and  (P,E)  are  Nodearrays  then  the  following  are  equivalent: 

i.  (P.E)  is  equivalent  to  (P,E)  with  transformation  matrix  T. 

ii.  (P,E)  is  equivalent  to  (P,E)  with  transformation  matrix  T* . 

iii.  (P.E)  £ (P.E)  c (P,E)  with  transformation  matrices  T and  T*,  respectively. 

Proof:  suppose  (i)  holds.  Then  P = T*  * P * T and  E = T*  * E * T.  Thus  P = T*P*T‘andE  = T*E* 
T* . Hence  (ii)  holds.  Similarly  (i)  holds  if  (ii)  holds.  Suppose  (i)  and  (ii)  hold.  Since  E cEwe  have  E c T * 
E * T*  , and  similarly  E c T*  * E * T.  Thus  (iii)  holds.  Finally,  if  (iii)  holds,  then  E c T*  * E * T and  EcT* 
E * T’ . By  Lemma  2.1,  T*E*T‘cE  and  T*  * E * T c E . Hence  (i)  and  (ii)  hold. 

□ 

Lemma  2.2  induces  an  equivalence  relation  on  the  set  of  Nodeairays.  Now  if  (P,E)  and  (P,  E)  are  Nodear- 
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rays  and  (P, E)  £ (P.E),  then  viewing  their  rows  and  columns  as  graphs,  the  graphs  for  (P, E)  are  subgraphs  of 
the  corresponding  graphs  for  (P,E).  More  generally  we  have 

LEMMA  2.3.  Suppose  (P,E)  and  ( P , E ) are  Nodearrays  and  (P.E)  £ (P,E)  with  transformation  matrix 
Q(ct).  Then 

i.  P(ij)  = P(o(i),o(j))  (ij  e S) 

ii.  E[ij]  > E [a(i),a(j)]  (ij  e S) 

iii.  if  nodes  are  adjacent  in  (P,E)  then  they  are  adjacent  in  (P.E). 

Proof:  let  T = Q(c).  Since  P = T*  * P * T we  have  P = T*P*T‘,so 

P(iJ)  = L X T[i,k]  * PM  * TU.n]  (ij  e S) 

i= 0 n =0 

Also  T[i,k]  = 1 iff  k = a(i),  and  T[j,n]  = 1 iff  n = a(j).  This  proves  (i);  (ii)  is  similar.  Now  suppose  P(ij,j| 
) and  P ( i^ , j^ ) are  adjacent  in  (P.E).  Let  ir  = a’1  (i')  and  jr  = a'1  ( j '),  r = 1,2.  Now  if  i'  = i2  and  E [ j ' , j^ ] = 
1,  then  i,  = i2  , and  by  (ii),  E[jj  ,j2]  > E[o(  jj  ),o(  j2)]  = E[jj  ,j2]  = 1.  Thus  E[j,  ,j2]  = 1.  Hence  P(ij , j j ) and 
P(i  2 , j 2 ) are  adjacent  in  (P,E).  The  same  is  true  if  jj  = j2  and  E[ij  ,i2]  = 1.  By  (i),  P(i  r , j r ) = P(i',j');  this 
proves  (iii). 

□ 

COROLLARY  2.1.  If  (PJE)  and  ( P , E ) are  Nodearrays  and  ( P , E)  is  equivalent  to  (P,E)  with  transforma- 
tion matrix  Q(o)  then 

i.  P(i  j)  = P (a(i),o(j))  (i  j € S) 

ii.  E[ij]  = E [a(i),a(j)]  (i j e S) 

iii.  Nodes  are  adjacent  in  (P,E)  iff  they  are  adjacent  in  (P,E). 

Proof:  (i)  is  direct  from  Lemmas  2.2iii  and  2.3i.  By  Lemma  2.2iii  again,  (P,E)  c (P,E)  with  transformation 
matrix  Q(a  ').  Applying  Lemma  2.3ii  to  C1  we  find 


7 


E[i',j']>E[cr1(i'),<r1(n] 


(i\T  e S) 


Writing  V = a(i), j'  = c(j)  in  the  above  gives  E[a(i),a(j)]  ^ E[ij].  Combined  with  Lemma  2.3ii  applied  to 
a this  gives  (ii).  Similarly  (iii)  follows  from  two  applications  of  Lemma  2.3iii. 

□ 

Equivalent  node  arrays  are  representations  of  rearrangements  of  the  same  underlying  physical  network  of 
nodes.  Now  a process  array  procarray  is  a collection  of  processes  {procarray(ij)];  a priori  these  have  no  connec- 
tions to  Nodearrays. 

DEFINITION  2.4.  We  say  process  array  procarray  is  hosted  by  Nodearray  (P,E)  if  for  ij  € S,  procarray(ij) 
is  loaded  to  P(ij). 

□ 

A process  array  may  prefer  to  be  hosted  by  a certain  Nodearray  because  of  alignment  of  data.  For  example, 
a specification  in  Section  1 is  that  Multiply(ij)  receives  A(ij)  from  Asourcefij)  which  is  also  on  P(ij).  This  in- 
terface occurs  with  minimal  communication  cost;  the  restriction  is  that  process  arrays  Asource  and  Multiply 
must  be  hosted  by  the  same  Nodearray.  The  same  is  true  for  Bsource  and  Multiply.  The  alignment  of  { A(i j)} 
and  { B(i  j) } upon  receipt  by  Multiply  is  inherited  from  the  topology  of  the  common  Nodearray.  Presumably  this 
topology  is  geared  to  Asource  and  Bsource.  Multiply  may  wish  to  have  a private  topology;  i.e.  it  may  wish  to 
be  hosted  by  its  own  Nodearray.  This  may  be  effected  physically  by  a realignment  of  the  { A(i j)}  and  (B(ij)} 
using  a wavefront  equivalent  to  a transformation  matrix.  In  deference  to  Csink,  Multiply  can  realign  {C(i j)}  be- 
fore passing  them  along,  using  the  inverse  of  the  previous  transformation  matrix.  In  Section  7 we  show  how 
this  can  be  done  transparently  by  inserting  filters  between  communicating  process  arrays  hosted  by  different  No- 
dearrays. 
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3.  Permutations,  adjacency  and  wavefronts. 

The  characterization  of  adjacency  in  Nodearray  (P.E)  by  the  definition  of  E is  local;  i.e.  it  tells  us  where  we 
can  send  a block  from  a node  with  minimal  communication  cost.  Of  greater  interest  is  a characterization  of  how 
we  can  interchange  blocks  among  nodes  in  a row  or  column  concurrently;  this  will  permit  us  to  construct  wave- 
fronts  flowing  through  rows,  columns  or  both.  Since  we  do  not  want  to  send  two  blocks  to  one  node,  the  basic 
unit  for  a step  of  a wavefront  is  the  permutation.  A wavefront  is  thus  represented  by  a composition  of  permuta- 
tions. Now  we  define  a family  of  generic  data  types  by 

type  Permarray[t]  = array[0..t-l]  of  Permutation; 

Generic  refers  to  the  parameter  t.  This  definition  is  in  deference  to  static  languages  such  as  C which  require 
the  array  bound  t to  be  specified  at  load  time,  thereby  instantiating  the  data  type. 

If  7t  is  a Permarray[t]  then  the  components  of  tt  are  designated  as  n0  , ...  , , where  rtz  is  a Permutation. 

Let  ;t be  the  Permarray[t]  defined  by 

(jf1)I  = (7Clpl.,)'1  (0  < z < t) 

For  the  same  n let  L*  be  an  associated  PermarTay[t+l]  defined  by 

L2  = n0°  - ° 71 2-1  (1  < Z < t) 

L o G)  = j G e S) 


It  follows  that 

L* ' =(L?)-1 

Also  define  the  ordinary  array  LK  : {1,  ...  ,t)  x S ->  S by 
L*  [z  j]  = L*  (j)  (1  < z < t,  j e S) 

We  will  note  later  that  compositions  of  the  form  L * describe  data  flow  in  wavefronts.  We  will  wish  to  re- 
tain the  option  of  disabling  the  flow  at  a node  during  a step;  thus  we  introduce  another  family  of  generic  types: 

type  Mask[t]  = array[0..t-l,0..s-l]  of  Boolean; 
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A Mask  is  an  ordinary  array  when  s and  t are  instantiated  at  load  time.  Full  enabling  of  nodes  is  represent- 
ed by  Jt  , a fixed  Mask[t]  defined  by  Jt[zj]  = 1 for  0 < z < t , j e S.  If  n is  a Permarray[t]  and  x is  a Mask[t] 
let  W**  be  an  associated  Permarray[s]  defined  by 


W 


= K 


-*[0.1] 


o n 


-Zh-Ul 

m 


(ie  S) 


where  for  any  permutation  a,  a°(j)  = j-  For  the  same  x and  n define  ordinary  array  W,  _ : S2  — > S by 
Wxx[zj]  =W^0)  (zje  S) 

Now  in  addition  to  their  later  use  in  wavefronts,  we  can  use  Permarrays  to  characterize  adjacency  in  No- 
dearrays.  First  we  need  the  following: 


DEFINITION  3.1.  Suppose  7t  is  a Permarray[t]  and  for  each  i € S,  7tz  (i)  * 7tz.(i)  for  z * z'.  Then  we  say  n 
is  discordant. 

□ 

The  study  of  discordant  permutations  in  combinatorial  and  statistical  settings  (e.g.  [18])  predates  their  ap- 
pearance here  by  nearly  three  centuries  [13].  For  example,  a well-known  fact  is  that  the  probability  that  an  arbi- 
trary Permarray[2]  is  discordant  is  very  close  to  1/e,  e = 2.718...,  essentially  independent  of  s. 

Discordant  permutations  are  fundamental  to  minimization  of  communication  cost  in  data-routing  on 
distributed-memory  machines.  They  represent  routings  of  data  among  a collection  of  positions  so  that  data  nev- 
er visits  the  same  position  twice  (cf.  Remark  5.1).  Moreover,  they  provide  an  orthogonal  characterization  of  ad- 
jacency for  Nodearrays: 

DEFINITION  3.2.  Suppose  rc  is  a discordant  Permarray[t].  With  Q as  in  Section  2,  let  INC(tc)  be  the 
array [0..s-l,0..s-l]  given  by 

INC(n)  = '£  Q(rtz) 

i=0 


Then  we  say  INC(7t)  is  the  incidence  matrix  of  n. 


□ 


LEMMA  3.1.  Suppose  n is  a discordant  Permarray[t]  and  E is  the  incidence  matrix  of  n.  Then 
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i.  E[ij]  is  the  number  of  z satisfying  j = 7tz(i). 


ii.  E[ij]  is  the  number  of  z satisfying  i = 7tz'(j). 

iii.  E is  an  array  of  Boolean. 

iv.  E has  exactly  t ones  in  each  row  and  column. 

v.  if  E is  symmetric  then  E is  an  adjacency  matrix  of  degree  t. 

Proof:  let  Tz  = Q(nz)  for  0 < z < L Then 

E[i  j]  = X T,  [i  j]  (i  j e S) 

1=0 

Now  Tz  [ij]  = 1 iff  j = rcz(i);  this  proves  (i).  Also  j = tcz  (i)  iff  i = Tt'^);  this  proves  (ii).  Now  if  j = nz  (i)  = 
7tz-(i),  since  tc  is  discordant  we  have  z = z\  By  (i),  E[ij]  < 1.  Since  E[ij]  > 0,  (iii)  follows.  Also 

t- 1 t- 1 t- 1 r-l  r-l 

Z E[ij]  = Z Z Tz[ij]  = £ Tz[i,rtz(i)]  = Z 1 = t 

j= o i^)  j=a  i=o  z =o 


Similarly 

Z E[ij]  = Z Tz  [rcz  (j) J]  = t 

i=0  z=0 

This  proves  (iv);  (v)  is  immediate. 

□ 


LEMMA  3.2.  Suppose  E is  an  adjacency  matrix  of  degree  L Then  there  exists  n,  a discordant  Permarray[t], 
with  incidence  matrix  E. 


Proof:  we  note  that  E is  Boolean  and  has  all  row  and  column  sums  equal  to  t.  By  ([19]  p.  57)  there  exist 
permutation  matrices  T0  , ...  , T on  S with 

r-l 

E=2T, 

2=0 

Define  n to  be  the  Permarrayft]  with  tcz  = Q1  (Tz)  for  0 < z < L Since  E is  Boolean, 
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E[ij]  = XTJij]<l 

i=0 

If  7tk(i)  = rck.(i)  for  some  ijc,k'  with  k * k',  since  Tz  = Q(7tz)  we  have 
E[i,7Ck  (i)]  > Tk[i,rck(i)]  + Tk- [i,Kk- (i)]  = 2 
This  is  a contradiction;  hence  7t  is  discordant.  Also 

INC(rt)  = £ Q(rcz)  = E 


DEFINITION  3.3.  Suppose  n is  a discordant  Permarray[t]  with  E = INC(rc).  Then  we  say  n is  a basis  for  E. 

□ 

Lemma  3.2  says  that  bases  exist  for  adjacency  matrices;  but  they  are  not  unique. 

LEMMA  3.3.  Suppose  n is  a discordant  Permarray[t],  (P,E)  and  (P.E)  are  Nodearrays,  (P.E)  is  equivalent 
to  (P.E)  with  transformation  matrix  T,  and  n is  a basis  for  E.  Then 

i.  in  (P,E)  the  nodes  adjacent  to  P(i  j)  are 

{P(i,7rz(j))}0Sz<t  u {P(7cz(i)J)}0Sz<l 

ii.  if  n is  the  Permarray[t]  defined  by  n = Q'1  (T*  * Q(rrz)  * T)  then  rt  is  a basis  for  E . 

Proof:  we  have  E(j,  jO  = 1 iff  there  exists  z with  j'  = 7r z (j);  (i)  follows  from  definition.  For  (ii)  we  have 

E = !Q(nz) 

1=0 

E=  lT‘*Q(nz)*T=  £QUz) 

2=0  2 =0 

Also  E : S2  —>  {0,1}  . Thus  n is  discordant  and  E = INC(  it). 

□ 

Lemma  3.3i  shows  that  the  adjacency  structure  of  Nodearray  (P,E)  is  characterized  by  any  basis  for  E. 
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Lemma  3.3ii  shows  that  the  homomorphic  image  of  a basis  is  a basis  for  an  equivalent  node  array.  Lemma  3.2 
guarantees  that  a basis  for  E can  always  be  found.  Now  we  need  to  interconnect  these  facts  with  wavefronts.  We 
noted  earlier  that  the  latter  may  be  characterized  by  compositions  of  permutations.  However,  these  will  only  be 
efficient  with  regard  to  communication  cost  if  they  are  related  to  basis  elements.  Specifically,  each  permutation 
should  be  achievable  in  one  parallel  step  involving  only  communication  between  neighboring  nodes. 

DEFINITION  3.4.  Call  a Permutation  a realizable  in  Nodearray  (P.E)  if  for  i e S,  E(i,o(i))  = 1 or  a(i)  = i 
(in  the  latter  case  i is  a fixed  point  of  a).  Equivalently,  a is  realizable  in  (P£)  iff  P(i  j)  is  adjacent  or  equal  to 
P(i,o(j))  and  P(a(i)j)  for  ij  e S. 

□ 

Compositions  of  realizable  permutations  represent  dataflow  in  a row  or  column  with  communication  only 
between  adjacent  nodes  at  each  step. 

DEFINITION  3.5.  If  n is  a Permarray[t]  and  (P,E)  is  a Nodearray,  we  say  rc  is  realizable  in  (P,E)  iff  nz  is 
realizable  in  (PJ~)  for  0 < z < t. 

□ 

If  n is  a realizable  Permarray[t]  and  L*  is  discordant,  the  latter  represents  a series  of  data  routings  yielding 
minimal  communication  cost:  t nodes  are  visited  by  a datum  in  t steps,  with  each  step  involving  only  transfers 
between  adjacent  nodes.  As  we  note  later  (specifically,  during  the  alignment  phase  of  matrix  multiplication), 
discordancy  is  sometimes  too  strong.  Fixed  points  permit  masking  in  some  steps.  Other  violations  of  discordancy 
may  be  indications  of  nonoptimality  in  algorithms. 

LEMMA  3.4.  Suppose  (P,E)  is  a Nodearray  and  E has  degree  L Then 

i.  If  rt  is  a discordant  Permarray[t]  with  no  fixed  points,  INC(7t)  = E iff  n is  realizable  in  (P.E). 

ii.  If  n is  a discordant  Permarray[tl  with  no  fixed  points  and  n is  realizable  in  (P,E),  we  can  find 
{rt2 ) t'sz<t  so  that  the  extension  7t  is  a discordant  Permarray[t]  with  INC(7t)  = E. 

Proof:  exercise. 

□ 
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4.  Programming  and  the  network  of  process  arrays. 

Here  we  describe  our  assumptions  concerning  processes  and  the  communications  subsystem  which  we  as- 
sume when  writing  pseudocode.  Also  we  describe  the  structure  and  interfacing  of  the  process  arrays  Asource, 
Bsource,  Multiply  and  Csink.  More  generally  we  address  some  issues  relevant  to  writing  modular  software  for 
message-passing  machines  such  as  the  iPSC. 

We  assume  that  a process  definition  consists  of  a collection  of  definitions  for  procedures,  types  and  symbol- 
ic constants,  with  one  procedure  designated  as  main.  We  assume  that  a procedures’s  referencing  environment 
consists  of  the  definitions  which  are  part  of  the  definition  of  the  process  containing  it,  plus  locally  defined  ob- 
jects. The  default  parameter  transmission  mode  will  be  call  by  value  (i.e.  in  only).  If  call  by  result  is  used  (out 
only)  it  will  be  explicitly  indicated  by  the  keyword  "out"  in  a procedure  heading. 

We  use  < > to  refer  to  generic  or  unspecific  code;  e.g.  <Element>  might  be  real,  double  precision  etc.  If 
buf  is  a pointer,  *buf  refers  to  the  contents  of  the  area  buf  points  to.  If  arr  is  an  array,  arr  = a means  assigning  a 
to  each  element  of  arr. 

We  assume  that  at  load  time  a process  is  supplied  with  a user-defined  integer  identifier,  with  unique 
identifiers  assigned  to  processes  on  a node.  Furthermore  we  assume  that  all  processes  in  a process  array  are  as- 
signed the  same  identifier;  then  we  define  the  identifier  of  the  process  array  to  be  this  common  integer. 

If  a process  array  procarray  is  hosted  by  Nodearray  (P.E),  we  assume  that  procarray(ij)  may  define  func- 
tions Row(  ) and  Col(  ) which  yield  i and  j,  respectively.  Presumably  these  are  synthesized  from  a system  call 
yielding  P(i  j). 

We  define  two  generic  types: 

type  Block  = array [0..m-l,0..m-l]  of  <Element>; 

Blockpointer  = pointer  to  Block; 

In  the  above,  <Element>  may  be  real  etc.  We  also  define  standard  procedures 

procedure  Null(x,y:  Block; 

a:  Blockpointer); 

begin 
end  Null; 
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procedure  Ccomp(x,y:  Block; 


a:  Blockpointer); 

begin 

<a=a+x*y> 
end  Ccomp; 

The  code  for  Ccomp  is  unspecific;  it  computes  the  product  of  the  blocks  x and  y in  standard  sequential 
fashion. 

Each  message  sent  by  a process  will  consist  of  one  Block.  A send  by  a process  is  a system  call  of  the 

form 

send(type,  buffer  .identifier,  node); 

where  type  is  a user-defined  integer  qualifier  for  messages,  buffer  is  a Blockpointer,  identifier  is  the  identifier  of 
the  process  to  which  *buffer  is  to  be  transmitted,  and  node  is  the  label  of  the  node  on  which  the  latter  process  is 
located. 

A receive  by  a process  is  a system  call 
recv(type, buffer); 

where  type  and  buffer  are  as  above.  This  fills  *buffer  with  the  first  block  sent  to  the  process  with  a matching 
type.  This  follows  the  format  of  the  Intel  iPSC;  in  particular  the  source  of  a message  is  not  explicitly  identified. 
We  define  a unit  block  transfer  by  a process  to  be  the  issuing  of  a send  to  an  adjacent  node. 

LEMMA  4.1.  If  n is  a Permarray[t]  realizable  in  (P,E)  and  process  array  procarray  is  hosted  by  (P,E)  then 
procarrayfi  j)  incurs  one  unit  block  transfer  when  it  issues  a call  of  the  form 

send(type,buffer,identifierj5(i,rtz(j));  (0  < z < t) 


or 


send(type,buffer,identifierj>(7tz(i),j);  (0  <=  z < t) 

Proof:  direct  from  definition. 
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□ 


In  the  following  we  assume  Nodearray  (P0,E0)  is  fixed. 

DEFINITION  4.1.  We  assume  Csink  is  hosted  by  (P0,E0)  and  has  identifier  0.  Also  assume  that  each 
Csink(i  j)  includes  in  its  code 

var  CL:  Blockpointer; 
recv(2,CL); 

Suppose  Csink(i  j)  issues  no  receives  of  type  2 prior  to  the  above,  no  sends  with  identifier  1 and  no  sends 
with  identifier  0 and  type  2. 

□ 


DEFINITION  4.2.  We  assume  Asource  is  hosted  by  (P0,E0)  and  has  identifier  > 2.  Assume  each 
Asource(i  j)  includes  in  its  code 

var  AL:  Blockpointer; 

ij:  integer, 
i = Row( ); 
j = Col( ); 
send(0,AL,lJ>0(ij)); 

Assume  Asource(ij)  includes  no  other  assignments  to  i or  j,  no  other  sends  with  identifier  1,  and  no  sends 
with  identifier  0 and  type  2.  Suppose  also  that  when  it  issues  the  above  send,  *AL  = A(i,j). 

□ 

DEFINITION  4.3.  We  assume  Bsource  is  hosted  by  (Pq.Hq)  and  has  identifier  > 2.  Assume  each 
Bsource(i  j)  includes  in  its  code: 

var  BL:  Blockpointer; 

ij:  integer, 
i = Row( ); 
j = Col( ); 


16 


send(l,BL,l,P0(i,j)); 


Assume  Bsource(ij)  includes  no  other  assignments  to  i or  j,  no  other  sends  with  identifier  1,  and  no  sends 
with  identifier  0 and  type  2.  Suppose  also  that  when  it  issues  the  above  send,  *BL  = B(i  j). 


□ 


DEFINITION  4.4.  Suppose  f has  specification 


procedure  f(al,bl:  Blockpointer; 

cl:  out  Blockpointer; 
p:  Nodelabel); 


Suppose  all  sends  issued  by  f have  identifier  1.  Then  write  f e PREMUL(s.m). 
sends  and  receives  issued  by  f have  a < type  < b.  Then  write  Typerange(f)  = [a,b). 


Suppose  further  that  all 

□ 


Implicitly  we  assume  that  if  Typerange(f)  = [a.b)  then  a and  b-1  are  used  as  types,  so  that  Typerange  is 
well-defined. 

DEFINITION  4.5.  Suppose  f e PREMUL(s,m),  Typerange(f)  = [a,b),  P is  a Nodelabel,  and  for  each  ij  e S, 
P(i  j)  has  some  process  proc(i  j)  with  identifier  1 which  makes  a call  of  the  form  f(AL,BL,CL,P).  Suppose  at  the 
time  of  this  call  on  P(ij)  we  have  *AL  = A(ij)  and  *BL  = B(i,j).  Suppose  any  send  with  identifier  1,  issued  by 
any  process,  with  the  exception  of  sends  issued  within  the  single  call  to  f in  a proc(i  j),  has  type  < a or  > b.  Sup- 
pose any  receive  issued  by  a proc(i  j),  with  the  exception  of  receives  issued  within  its  call  to  f,  has  type  < a or  > 
b.  Then  let  Out(f;P,i  j)  be  the  value  of  *CL  at  the  time  of  exit  from  the  call  to  f on  P(i  j). 

□ 

LEMMA  4.2.  In  Definition  4.5,  the  value  of  Out  is  independent  of  the  (proc(ij)). 

Proof:  the  values  of  *al,  *bl  and  p received  by  an  f are  fixed  by  the  hypothesis  of  Definition  4.5,  and  *cl  is 
out  only.  Hence  the  data  received  by  an  f in  the  form  of  parameters  does  not  depend  on  the  calling  process. 
Now  proc(ij)  is  the  unique  process  on  a p(ij)  with  identifier  1.  Hence  all  sends  in  a call  to  an  f are  to  some 
proc(i,j).  Any  receives  in  a proc(i  j)  outside  its  call  to  f have  type  < a or  > b,  and  cannot  be  matched  with  a send 
from  an  f.  Thus  all  sends  from  within  a call  to  an  f are  matched  with  a receive  in  a call  to  an  f.  Conversely, 
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sends  not  issued  from  within  an  f,  with  identifier  1,  have  type  < a or  > b,  and  cannot  be  matched  with  a receive 
within  an  f.  Hence  all  receives  within  an  f are  matched  by  sends  from  within  an  f.  Thus  a copy  of  f receives 
data  by  message  transmission  only  from  copies  of  f;  and  all  copies  receive  fixed  data  via  parameter  transmission. 
Hence  the  value  of  *cl  on  exit  is  independent  of  the  calling  process. 

□ 

DEFINITION  4.6.  Suppose  f € PREMUL(s,m).  Define  g = NETLAYER(f)  by 


procedure  g(p:  Nodelabel); 
var  al,bl,cl:  Blockpointer; 

i j:  integer; 
begin 

i = Row( ); 
j = Col( ); 
recv(O^tl); 
recv(l.bl); 
f(al,bl,cl,p); 
send(2,cl,0,p(ij)) 
end  g; 


□ 


As  their  names  suggest,  PREMUL  represents  potential  kernels  for  multiply  routines  and  NETLAYER  sur- 
rounds these  with  an  interface  to  the  network  of  process  arrays.  Elements  of  PREMUL  thus  communicate  among 
themselves  exclusively.  The  following  verifies  that  the  interface  to  the  network  functions  properly. 

LEMMA  4.3.  Suppose  f e PREMUL(s,m),  Typerange(f)  = [a,b),  a > 3 , and  g = NETLAYER (0-  Suppose 
process  array  procarray  is  hosted  by  (P0,E0)  and  has  identifier  1.  Suppose  procarray(ij)  has  the  form 

<definitions> 
procedure  main(  ); 

var  P0  : Node  label; 
begin 
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<define  P0> 

g(P0) 

end  main. 

Suppose  procarray,  Asource,  Bsource  and  Csink  are  the  only  processes  loaded.  Then 

i.  procarray(i j)  receives  A(i j)  from  Asource(ij)  and  B(ij)  from  Bsource(ij). 

ii.  After  Csink(i  j)  executes  recv(2,CL)  we  have  *CL  = Out(f;P0  4 j). 

Proof:  f issues  no  sends  of  type  0,  so  neither  does  g,  so  neither  does  procarray.  Referring  to  Definitions  4.1 
- 4.3,  Bsource  issues  no  sends  with  identifier  1 and  type  0;  Csink  issues  no  sends  with  identifier  1.  Referring  to 
the  format  of  Definition  4.6,  the  recv(0,al)  in  the  call  to  g in  procarray(i j)  can  only  be  matched  by  the  send  in 
Asource(ij)  with  identifier  1.  Conversely,  the  latter  send  cannot  be  matched  with  a receive  in  Asource,  Bsource 
or  Csink  since  these  have  identifiers  * 1.  Also,  f has  no  receives  of  type  0.  Hence  the  recv(0,al)  in  g in 
procarray(ij)  is  matched  by  the  send  in  Asource(ij)  with  identifier  1.  Thus  the  former  yields  *al  = A(ij).  Simi- 
larly the  recv(l.bl)  in  g in  procarray(ij)  gives  *bl  = B(ij).  This  proves  (i). 

Now  in  Definition  4.5  take  p = P0  ; note  procarray(ij)  makes  a call  f(al,bl,cl,p)  via  its  call  to  g.  By  (i),  at 
the  time  of  this  call  we  have  *al  = A(ij)  and  *bl  = B(ij).  Sends  by  Asource  and  Bsource  with  identifier  1 have 
types  < a (in  fact  0 or  1);  Csink  issues  no  sends  with  identifier  1.  Also,  sends  by  procarray,  outside  of  f,  have 
type  < a (in  fact  = 2).  Hence  sends  not  in  a call  to  f,  with  identifier  1,  have  type  < a.  Receives  issued  by  procar- 
ray, outside  of  f,  have  type  < a (in  fact  0 or  1).  Hence  the  conditions  of  Definition  4.5  are  satisfied,  and  the  call 
to  f within  the  call  to  g in  procarray (i,j)  gives 

*cl  = Out(f;P0,ij) 

Now  Asource,  Bsource  and  Csink  issue  no  sends  with  identifier  0 and  type  2.  Also,  f issues  no  sends  of 
type  2.  Thus  the  recv(2,CL)  in  Csink(ij)  can  be  matched  only  by  the  send  in  g in  procarray (i  j).  Conversely,  the 
send  in  g has  identifier  0,  and  hence  cannot  be  matched  with  a receive  in  Asource,  Bsource,  or  procarray.  Thus 
it  matches  the  receive  in  Csink(i  j).  This  proves  (ii). 

□ 
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5.  A wavefront. 


We  develop  code  in  this  section  for  what  might  be  termed  a symmetric  dual  masked  wavefront  flowing 
through  a Nodearray.  Dual  means  data  flows  horizontally  and  vertically,  and  symmetric  means  the  two  flows 
have  similar  structures.  Masked  means  that  only  designated  nodes  are  enabled  in  a given  step  of  one  of  the 
flows.  If  the  mask  is  set  to  1 then  the  flows  are  homogeneous,  i.e.  data  flow  within  a row  (respectively  column) 
is  the  same  for  all  rows  (respectively  columns).  Such  a wavefront  can  be  coded  as 

procedure  Wave(horbuf,verbuf,accum:  Blockpointer; 
p:  Nodelabel; 

compute:  procedure(hbl,vbl:  Block; 

acc:  Blockpointer); 
base,hostid,steps:  integer; 
k : Permarray [steps]; 

X : Mask[steps]); 
var  i j,z:  integer; 
begin 

i = row(  ); 
j = col( ); 

for  z = 0 to  steps- 1 do 
begin 

if  x[z.»]  = 1 then 
begin 

send(z+base,horbuf,hostid,p(i,7tz(j))); 

recv(z+base,horbuf) 

end; 

if  xfzj]  = 1 then 
begin 

send(z+steps+base,verbuf,hostid,p(rcz(i)j)); 

recv(z-t-steps+base,verbuf) 
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end; 


compute(*horbuf,*verbuf,accum) 

end 

end  Wave; 

In  the  above,  compute  represents  activity  at  a node  when  data  has  arrived;  accum  stores  the  result;  steps  is 
the  number  of  steps  comprising  the  wave.  The  flow  of  data  is  described  by  compositions  of  Permutations,  name- 
ly the  components  of  n.  The  use  of  z+base  and  z+steps+base  as  types  ensures  that  messages  are  received  in 
correct  order. 

DEFINITION  5.1.  Suppose  P is  a Nodelabel  and  a process  on  P(ij)  issues  a call  of  the  form 
Wave(x,y,accum,P, compute, b,hostid,steps,:t,x); 

Then  this  call  defines  functions  h and  v where  for  0 < r < steps,  h(r;i  j)  and  v(r;i  j)  are  the  values  of  *hor- 
buf  and  *verbuf  respectively  after  r iterations  of  the  loop  in  Wave  (these  definitions  are  strictly  local  to  one  call 
on  one  node). 

□ 

LEMMA  5.1.  Suppose  P is  a Nodelabel.  Suppose  that  for  each  ij  € S,  P(ij)  has  a process  proc(ij)  with 
identifier  1 which  includes  a declaration 

var  X,Y:  Blockpointer, 

and  a call 


Wave(X,Y,  Accum  ,P  .Compute  ,b,  1 ,t,7t  J t ); 

Suppose  no  sends  with  identifier  1 and  b < type  < b+2t  are  issued  by  any  process  except  for  sends  issued 
within  the  above  call  to  Wave  by  a proc(ij).  Suppose  no  proc(ij)  issues  receives  with  b < type  < b+2t  except 
within  its  call  to  Wave.  Then  for  the  call  to  Wave  on  P(i  j): 
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i.  h(r+l;ij)  = hfowr;^)) 


(0  < r < t) 


ii.  v(r+l;ij)  = v(r;rt;1(i)j)  (0  < r < t) 

iii.  h(r;ij)  = h(0;i,L*(j))  (0  < r < t) 

iv.  v(r;ij)  = v(0;L*(i)j)  (0  < r < t) 

v.  the  executions  of  Compute  are  equivalent  to: 

for  z = 0 to  t-1  do  Compute(h(0;i,L  *+1  (j)),v(0;L*+1  (i)  j),Accum); 

Proof:  after  r iterations  of  the  loop  in  Wave  on  any  node,  0 < r < t,  we  have  z = r in  the  next  pass  through 
the  loop.  In  the  call  to  Wave  on  P(i  j)  the  first  receive  is 

recv(r+b,horbuf); 

It  is  issued  from  P(ij)  by  proc(ij)  which  has  identifier  1.  Since  b < r+b  < b+t,  by  hypothesis  it  can  only  be 
matched  by  a send  from  some  Wave.  Since  second  sends  in  Waves  have  type  > b+t,  the  matching  send  must  be 
the  first.  The  latter  will  be  issued  from  some  P(i'j')  and  will  have  the  form 

send(z  '+b,horbuf,  1 ,P(i  \n  z.  (j '))); 

For  the  receive  and  send  to  match,  necessarily  z'  = r , \ = i and  nz.  (jO  = j;  the  latter  gives  j'  = n ra  (j) . 
Now  proc(ij)  is  the  unique  process  on  P(ij)  with  identifier  1.  Also,  since  b < z+b  < b+2t,  by  hypothesis 
proc(ij)  cannot  receive  the  above  send  outside  its  call  to  Wave.  Hence  the  above  send  and  receive  do  match. 
Now  this  send  occurs  after  r iterations  of  Wave  on  PO'jO;  by  Definition  5.1,  the  value  of  *horbuf  in  the  send  is 
h(r;i'j0-  Then  upon  completion  of  the  first  receive  in  Wave  on  P(ij)  we  have  *horbuf  = h(r;ij).  Now  verbuf 
points  to  a different  area;  hence  the  second  receive  in  Wave  on  P(i  j)  does  not  alter  the  above,  which  becomes 
the  value  of  *horbuf  on  P(ij)  after  r+1  iterations  of  the  loop  in  Wave.  This  proves  (i);  (ii)  is  similar. 

Now  (iii)  is  true  for  r = 0;  assume  true  for  some  r < t.  By  (i), 

h(r+l;ij)  = h(0;i,(L*  o <)()))  = htf);^  (j)) 

This  proves  (iii)  by  induction;  (iv)  is  similar.  Also,  the  sequence  of  Computes  is  given  by 
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for  z = 0 to  t-1  do  Compute(h(z+l;ij),v(z+l;ij),Accum); 


Combined  with  (iii)  and  (iv),  (v)  is  an  immediate  consequence. 

□ 

LEMMA  5.2.  Suppose  P is  a Nodelabel.  Suppose  for  each  ij  e S,  P(ij)  has  a process  proc(ij)  with 
identifier  1 which  includes 

var  X:  Blockpointer; 

Wave(X,X  jiullaccum  J\Null,b,  1 ,t,rt  J , ); 

Suppose  that  no  sends  with  identifier  1 and  b < type  < b+2t  are  issued  by  any  process,  except  by  a proc(ij) 
within  its  call  to  Wave.  Suppose  no  proc(ij)  issues  receives  with  b < type  < b+2t  except  in  its  call  to  Wave. 
Then  for  the  call  to  Wave  on  P(i  j)  we  have  h = v and 

h(t;ij)  = h(0;L*(i),L*(j))- 

Proof:  horbuf  and  verbuf  point  to  the  same  area,  so  h = v.  Now  in  the  pass  through  the  loop  in  Wave  with 
z = r,  the  second  receive  is  to  the  same  area  as  the  first.  Given  ij  e S let  V = it'*  (i)  and  j'  = it'*  (j)-  Now  on 
P(i'jO,  the  first  send  is 

send(r+b,horbuf,l  ,P(i ' j); 

In  the  above,  *horbuf  = h(r;i'j').  By  hypothesis  this  send  must  be  matched  by  a receive  within  the  Wave 
on  P(i'j).  Since  b <=  r+b  < b+t  it  must  be  matched  by  the  first  receive.  The  latter  is 

recv(r+b,horbuf); 

By  hypothesis,  only  the  above  send  can  match  this  receive.  This  transfers  h(r,i'jO  to  horbuf  = verbuf  on  P(i'j). 
Now  the  second  send  on  the  latter  is 

send(r+t+b, verbuf,  1 ,P(i  j)); 

This  must  be  matched  with  a receive  within  the  Wave  on  P(i j).  Since  first  receives  have  type  < b+t,  the  above 
must  be  matched  by  the  second  receive  on  P(i,j),  namely 
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recv(r+t+b,verbuf); 


By  hypothesis  this  receive  can  only  be  matched  by  the  above  send.  This  transfers  h(r;i'j')  to  horbuf  = verbuf 
on  P(ij).  Hence 

h(r+l;i  j)  = hfoTt^O),*;^))  (0  < r < t) 

As  in  Lemma  5.1,  it  follows  by  induction  that 

h(r;ij)  = h(0;L*(i),L*(j))  (0  < r < t) 

The  result  follows. 

□ 

COROLLARY  5.1.  With  the  hypothesis  of  Lemma  5.2,  the  action  of  Wave  on  P(ij)  is  equivalent  to 

send(b,X,l,P(L?(i),L*G)); 

Proof:  immediate  from  Lemma  5.2. 

□ 

LEMMA  5.3.  Suppose  P is  a Nodelabel.  Suppose  for  each  ij  e S,  P(ij)  has  a process  proc(ij)  with 
identifier  1 which  contains 

var  X,Y:  Blockpointer; 

Wave(X,YjiullaccumJ>,Null,b,l,t,K,x); 

Suppose  no  sends  with  identifier  1 and  b < type  < b+2t  are  Issued  by  any  process,  except  by  a proc(ij) 
within  its  call  to  Wave.  Suppose  no  proc(ij)  issues  receives  with  b < type  < b+2t  except  in  Wave.  Then  for  the 
call  to  Wave  on  P(i  j)  we  have 

h(t;ij)  = h(0;i,WxJij]) 
v(t;ij)  = v(0;Wz„U4]J) 

Proof:  similar  to  Lemma  5.1.  After  r iterations  of  the  loop  we  take  into  account  whether  P(i,j)  is  enabled 
horizontally  or  vertically.  This  yields 
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h(r+l;ij)  = h(r;i,7tr*Ir,ll(j)) 

and  similarly  for  v(r+l;ij).  Analogues  of  the  results  in  Lemma  5.1  are  immediate. 

□ 

COROLLARY  5.2.  With  the  hypothesis  of  Lemma  5.3,  the  action  of  Wave  on  P(ij)  is  equivalent  to 
send(b^,U>(i,Wx1t[ij])); 
send(b+l,Y  ,Hostid,P(Wx  ^ [j,i]  j)); 

Proof:  immediate  from  Lemma  5.3. 

□ 

REMARK  5.1.  If  n is  realizable,  a Wave(...,7t,x)  routes  all  messages  explicitly,  i.e.  sending  blocks  between 
adjacent  nodes  at  each  step.  However,  a store-and-forward  system  such  as  the  iPSC  provides  virtual  circuit  ser- 
vice, i.e.  implicit  routing  by  the  system  of  messages  between  non-adjacent  nodes.  On  such  a machine  the  results 
of  Corollaries  5.1  and  5.2  can  be  used  to  abrogate  the  sequence  of  messages  used  in  Wave  in  Lemmas  5.2  and 
5.3.  Implicit  routing  may  reduce  overhead,  although  the  cost  is  the  same  if  measured  by  unit  block  transfers.  On 
the  other  hand,  our  wavefront  routes  data  with  maximum  parallelism  at  each  step  because  of  the  use  of  permuta- 
tions; this  holds  "conflicts"  at  nodes  to  a minimum.  The  latter  is  in  reference  to  the  buffering  necessary  when 
several  messages  must  pass  through  a node  concurrently.  In  fact,  if  our  wavefront  were  synchronous  (i.e.  systol- 
ic), system  buffering  would  be  unnecessary.  To  compete  with  implicit  routing,  which  probably  uses  paths  of 
minimum  length,  the  wavefront  should  do  the  same.  In  particular  L*  should  be  discordant,  except  for  fixed 
points  produced  by  masking;  otherwise  some  paths  will  contain  cycles.  Even  with  the  number  of  steps  (t  if  n is 
a Permarray[t])  minimal,  however,  it  may  be  advantageous  to  keep  both  implicit  and  explicit  routing  as  options 
in  store-and-forward  systems. 

□ 

LEMMA  5.4  Suppose  n is  a Permarray[t]  which  is  realizable  in  Nodearray  (P,E).  Suppose  a process  on 
P(i,j)  makes  a call  of  the  form 

Wave(x,y,acc,P,  compute,  b,hostid,t,7t,x); 

Then  the  process  will  incur  at  most  2n  unit  block  transfers  in  executing  this  call,  where  n is  the  maximum 
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column  sum  of  X-  If  X = Jt  lhen  exactly  2t  unit  block  transfers  are  incurred. 


Proof:  direct  from  Lemma  4.1. 
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6.  Matrix  multiplication. 

We  show  here  that  the  search  for  a process  array  Multiply  meeting  the  specifications  of  Section  1 can  be 
reduced  to  the  search  for  certain  ordered  triples.  The  main  result  will  show  that  a suitable  class  of  such  triples 
can  be  characterized  in  purely  mathematical  terms.  Examples  are  given  in  Sections  8,  9 and  11.  Having 
identified  an  instance  of  this  class,  it  is  straightforward  to  turn  it  into  a process  array  by  adding  two  layers:  one 
to  interface  with  the  network  of  process  arrays,  and  the  other  to  interface  with  the  programming  environment 
The  latter  permit  development  of  multiplication  modules  which  can  be  used  with  different  algorithms  and  ported 
between  machines,  respectively.  Furthermore  we  can  change  topology:  in  Section  7 we  will  show  that  once  we 
have  located  a triple  for  a given  Nodearray  we  can  in  effect  port  it  to  other  Nodearrays  by  adding  another  outer 
layer.  In  Section  12  we  give  an  example. 

For  s and  m as  in  Section  1,  t an  integer,  P a Nodelabel,  n a Permarrayft],  y a Permarray[s]  and  x a 
Mask[t]  define  a generic  process  Mulproc(s,m,t,P,7t,y,x)  via  the  code  segment 

< define  constant  s,m,t:  integer; 

type  Block,Blockpointer,Nodelabel, Permutation, Boolean  ,Permarray[  ],Mask[  ]; 
constant  Js:  Mask[s]; 
function  Row, Col; 
procedure  Wave,Null,Ccomp  > 
procedure  main(  ); 
var  P:  Nodelabel; 

AL,BL,CL:  Blockpointer; 
ij:  integer; 

7t:  Permarray[t]; 
y:  Permarray[s]; 

%:  Mask[t]; 
begin 

< define  P,7t,y,x  > 
i = Row( ); 
j = Col( ); 
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recv(0,AL); 


recv(13L); 

*CL  = 0; 

Wave(AL,BL,CL,P,Null,3,l  ,t,7c,x); 

Wave(AL,BL,CL,P,Ccomp,3+2t,l,s,y,Js); 
send(2,CL,0,P(ij)) 
end  main; 

In  the  above.  Nodelabel,  Permutation,  Boolean,  Perm  array  [ ] and  Mask[  ] are  as  in  Section  2;  Block,  Block- 
pointer,  Row,  Col,  Null  and  Ccomp  as  in  Section  4;  and  Wave  as  in  Section  5.  These  and  s,m,t3J,  ,rc,\g,x  must 
be  instantiated  to  produce  a concrete  process.  The  characterization  of  s^n.tj,  as  symbolic  constants  external  to 
main(  )and  P,  7t,y,x  as  variables  within  main(  ) is  unimportant  and  merely  reflects  anticipated  programming  en- 
vironments such  as  Unix.  The  two  waves  above  represent  what  Dekel,  Nassimi  and  Sahni  [1]  refer  to  as  the 
alignment  and  shift-multiply  phases  of  a matrix  multiplication  algorithm.  In  the  following,  (P0,E0)  will  be  as  in 
Section  4. 

DEFINITION  6.1.  A Latin  square  on  S is  an  s x s array  whose  rows  and  columns  are  Permutations. 

□ 

A Latin  square  on  S differs  from  a discordant  Permarray[s]  only  in  that  the  rows  and  columns  of  the  latter 
have  a particular  labeling. 

DEFINITION  6.2.  Suppose  7t  is  a Permarray[t],  y is  a Permarray[s],  and  x is  a Mask[t].  Suppose  further 
that  Ly  is  a Latin  square,  W2JC  is  symmetric,  and 

v,n  w w y,* 

wi  °Lk  = LkoWi  (i  e S;  1 < k < s) 

Then  write  (7t,\y,x)  G KER(s). 

□ 

The  above  provides  the  mathematical  characterization  we  alluded  to  at  the  start  of  the  section.  In  particular, 
it  will  permit  us  to  isolate  the  salient  features  of  the  Dekel/Nassimi/Sahni  algorithm  for  hypercubes.  The 
remainder  of  the  section  is  devoted  to  showing  that  the  above  triples  are  a suitable  class  for  instantiating 
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Mulproc,  thus  providing  a unified  description  of  matrix  multiplication  algorithms  on  machines  with  embedded 
toroidal  meshes. 

DEFINITION  6.3.  Consider  the  generic  procedure  definition 

procedure  f(al,bl:  Blockpointer; 

cl:  out  Blockpointer; 
p:  Nodelabel); 
var  n\  Permarray[t]; 

\y:  Permarray[s]; 

X'-  Mask[t]; 
begin 

< define  7t,y,x  > 

*cl  = 0; 

Wave(al,bl,cl,p,Null,3,l,t,7t,y); 

Wave(al,bl,cl,p,Ccomp,3+2t,l,s,\|/Jt ) 
end  f; 

If  f is  as  above  write  f = F(ti,\|/,x). 

□ 

LEMMA  6.1.  Suppose  n is  a Permarraytt],  v is  a Permarray[s],  and  x is  a Mask[t].  If  f = F(7t,y,x ) then 

i.  f e PREMUL(s.m). 

ii.  Typerange(f)  = [3,3+2t+2s). 

iii.  If  n and  \\i  are  realizable  in  Nodearray  (P.E)  and  a process  on  P(i  j)  makes  a call  of  the  form 

f(AL,BL,CLJ>); 

then  the  process  incurs  at  most  2s+2n  unit  block  transfers  in  executing  this  call,  where  n is  the  max- 
imum column  sum  of  X- 
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Proof:  in  Definition  6.3,  the  first  Wave  has  types  for  sends  and  receives  starting  at  3;  the  second  Wave  has 
largest  type  3+2t+2s-l.  This  proves  (i)  and  (ii);  (iii)  is  direct  from  Lemma  5.4. 

□ 

DEFINITION  6.4.  Suppose  n is  a Permarraytt],  y is  a Permarray[s],  and  x is  a Mask[t].  Suppose  further 
that  Out(F(7t,y,x);p,i,j)  = C(ij)  for  each  Nodelabel  p and  all  ij  e S.  Then  write  (:t,y,x)  e MUL(s). 

□ 

THEOREM  6.1.  Suppose  7t  is  a Permarraytt],  y is  a Permarrayts],  x is  a Mask[t],  and  ( rc.y.x ) e KER(s). 
Then  (rc.y.x)  e MUL(s). 

Proof:  let  f = F(n,y,x).  By  Lemma  6.1,  f e PREMUL(s,m)  and  Typerangeff)  = [3,3+2t+2s).  Suppose  P is  a 
Nodelabel  and  for  each  ij  e S,  P(ij)  contains  a process  proc(i j)  with  identifier  1 which  includes 

var  AL3L:  Blockpointer; 

f(AL3L,CL,P); 

Suppose  *AL  = A(i  j)  and  *BL  = B(ij)  on  P(ij)  when  f is  called.  Suppose  any  send  with  identifier  1 issued 

by  any  process,  with  the  exception  of  sends  issued  within  the  single  call  to  f in  a proc(ij),  has  type  < 3 or  > 

3+2t+2s.  Suppose  any  receive  issued  by  a proc(ij),  with  the  exception  of  receives  issued  within  its  call  to  f,  has 
type  < 3 or  > 3+2t+2s.  Now  in  the  call  to  f on  P(i  j),  referring  to  the  format  of  Definition  6.3,  we  have 

Claim  1:  on  exit  from  the  first  Wave  in  f on  P(i j),  *al  = A(i,  W^  n [ij])  and  *bl  = B(  W^  n [j,i]  j). 

Claim  2:  on  exit  from  the  second  Wave  in  f on  P(i  j),  *cl  = C(i  j). 

We  note  that  we  have  satisfied  the  hypothesis  of  Definition  4.5,  and  hence  Out(f;P,i  j)  is  the  value  of  *CL 
on  exit  from  f on  P(i  j),  or  equivalently  the  value  of  *cl  on  exit  ffom  the  second  Wave.  Hence  the  Theorem  fol- 
lows from  Claim  2. 

Proof  of  Claim  1:  on  entry  to  the  first  Wave  on  P(ij)  we  have  *al  = A(ij)  and  *bl  = B(ij).  In  the  notation 
of  Definition  5.1,  for  this  call  h(0;ij)  = A(ij)  and  v(0;ij)=  B(ij).  On  exit  from  the  first  Wave  *al  = h(t;i j)  and 
*bl  = v(t;ij).  Now  by  hypothesis,  no  sends  with  identifier  1 and  3 < type  < 3+2t  are  issued  by  any  process,  ex- 
cept by  a proc(i  j)  in  its  call  to  f.  Sends  and  receives  issued  in  the  second  Wave  in  f have  type  > 3+2t;  hence  no 
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sends  with  identifier  1 and  3 < type  < 3+2t  are  issued  outside  the  first  Wave  in  f.  Also  by  hypothesis,  any  re- 
ceive issued  by  proc(ij)  outside  of  its  call  to  f has  type  < 3 or  > 3+2l  Hence  no  proc(i j)  issues  receives  with  3 
< type  < 3+2t  except  in  the  first  Wave  in  its  f.  Thus  the  conditions  of  Lemma  5.3  are  satisfied  for  the  first  Wave 
in  f on  P(i  j),  and  we  have  on  exit  from  the  first  Wave 

*al  = h(t;iJ)  = h(0;i,WxJij]) 

*bl  = v(t;ij)  = v(0;Wz1t[j4]j) 

Claim  1 follows. 

Proof  of  Claim  2:  on  entry  to  the  second  Wave  in  f on  P(ij),  the  values  of  *al  and  *bl  are  as  in  Claim  1. 
Now  in  the  notation  of  Definition  5.1  applied  to  the  second  Wave  in  f, 

h(0;i  j)  = A(i,WxJij]) 

v(0;ij)  = B(WXKU,i]j) 

By  hypothesis,  no  sends  with  identifier  1 and  3+2t  < type  < 3+2t+2s  are  issued  by  any  process  except  for 
sends  by  a proc(i  j)  in  its  f.  Sends  and  receives  in  the  first  Wave  in  f have  type  < 3+2t.  Hence  no  sends  with 
identifier  1 and  3+2t  < type  < 3+2t+2s  are  issued  outside  the  second  Wave  in  f in  a proc(i  j).  Also  by  hypothesis, 
no  proc(ij)  issues  receives  with  3+2t  < type  < 3+2t+2s  except  in  its  f.  Hence  no  proc(ij)  issues  receives  with 
3+2t  < type  < 3+2t+2s  except  in  the  second  Wave  in  f.  Thus  Lemma  5.1  applies  to  the  second  Wave,  and  the 
executions  of  Ccomp  are  equivalent  to: 

for  k = 0 to  s-1  do  Ccomp(h(0;i,L^+1(j)),v(0;Lj+](i)j),*cl); 

or  equivalently: 

for  k = 0 to  s-1  do  Ccomp(A(i,Wzn[i,L^1(j)]),B(WXff[j.Lk+1(i)]j),*cl); 


Taking  into  account  *cl  = 0 at  the  start  of  f we  find  that  on  exit  from  the  second  Wave  in  f on  P(i  j). 


= t A(i,(  wr  C Lk)(j))  * B((  wj*  o Lj)(i)j) 

*= 1 

= t A(i,(  Lk  o wf*  )(j))  * B((  Lk  o W-  )(i)j) 

*=i 

= t A(i,  L*(  W [ij]))  * B(  L*(  W 04])  j) 

*=i 

= £ A(i,  L*(  W [ij]))  * B(  L*(  W [i j]) j) 

*=i 

Fix  i and  j for  the  moment  and  define  (only  for  this  claim) 

$00  = LkV(  \V  Jij])  = L¥(k,  Wx1t[ij])  (1  < k < s) 

Then 


*cl  = £ A(i£(k»  * B(^(k)  j) 


*=i 


Now  is  a Latin  square  on  S;  hence  its  columns  are  Permutations.  Thus  E,  is  a Permutation.  In  the  last 
sum  above  we  can  write 

k = i;'1  GO  (0  < k'  < s) 

yielding 

*cl  = £ Am  * B(k  J)  = C(ij) 

*= o 

Now  i and  j are  arbitrary;  Claim  2 follows. 

□ 

DEFINITION  6.5.  Suppose  f e PREMUL(s.m).  Let  g = NETLAYER(f).  For  i j e S define  Mul(ij)  by 

< define  type  Block, Blockpointer, Nodelabel, Permutation,  Boolean,  Permarray[  ],Mask[  ]; 
constant  s,t:  integer; 

Js : Mask[s]; 
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function  Row.Col; 


procedure  g,Wave,Null,Ccomp  > 

procedure  main(  ); 

var  P0 : Nodelabel; 
begin 

< define  P0  > 

g(P0) 

end  main; 

This  defines  process  array  Mul.  Suppose  Mul  is  hosted  by  (P0  ,E0)  and  has  identifier  1.  Then  write  Mul  = 
PROGLAYER(f). 

□ 

DEFINITION  6.6.  Suppose  (rc.y.x)  e KER(s)  and  n and  vjr  are  realizable  in  Nodearray  (PJE).  Then  write 
(K,V,X)e  KER(s;P,E). 

□ 

THEOREM  6.2.  Suppose  t is  a positive  integer,  re  is  a Permarray[t],  y is  a Permarray[s],  x is  a Maskft], 
and  ( 7c,y,x  ) e KER(s;P0  JE0 ).  For  each  i,j  e S let  Multiply(ij)  = Mulproc(s,m,t,P0 , n,\|/,x  )•  Then  if  Asource, 
Bsource,  Multiply,  and  Csink  are  the  only  processes  loaded, 

i.  Multiply(ij),  Asource(ij),  Bsource(ij)  and  Csink(ij)  are  all  loaded  to  P0  (i j). 

ii.  Multiply(ij)  receives  A(ij)  and  B(ij)  from  Asource(ij)  and  Bsource(ij)  respectively. 

iii.  Multiply(ij)  sends  C(ij)  to  Csink(ij). 

iv.  Multiply(i  j)  incurs  at  most  2s+2n  unit  block  transfers,  where  n = maximum  column  sum  of  X- 

Proof:  let  f = F(^,y,x ).  Then  by  Lemma  6.1,  f e PREMUL(s.m),  and  Typerange(f)  = [3,b)  where  b > 3 . 
Now  (i)  is  direct  from  Definitions  4.1  - 4.3  and  6.5.  In  Lemma  4.3  take  procarray  = Multiply;  (ii)  is  immediate 
from  Lemma  4.3i.  Also,  since  ( rc,\y,x ) e MUL(s),  Lemma  4.3ii  shows  that  after  Csink(i  j)  executes  its  receive, 
*CL  = Out(f;P0,i  j)  = C(i  j).  This  proves  (iii).  Referring  to  the  formats  of  Definitions  4.6  and  6.5,  the  call  to 
g(P0)  involves  only  one  send  outside  the  call  to  f within  g,  namely 
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send(2,cl,0  J»0  (i  j)); 


Since  the  process  issuing  this  send,  namely  Multiply(ij),  is  on  P0  (ij),  this  send  involves  no  unit  block 
transfers  by  our  definition.  Hence  the  only  unit  block  transfers  come  from  the  call  to  f in  g,  and  (iv)  follows 
from  Lemma  6. liii. 

□ 
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7.  Changing  topologies. 

In  Section  4 we  noted  that  process  arrays  Asource,  Bsource,  Multiply  and  Csink  should  all  be  hosted  by  the 
same  Nodearray  (P0  ,E0  ).  This  minimizes  communication  cost  during  interface  of  processes  since  no  intemode 
communication  is  involved  in  passing  blocks  of  A,B,C.  However,  it  forces  Multiply  to  employ  the  same  topolo- 
gy as  Asource,  Bsource  and  Csink.  A separate  topology,  represented  by  an  alternative  Nodearray  (Pj  ),  may 
be  desirable  for  the  actual  multiplication  A*B.  We  can  solve  this  problem  by  having  each  Multiply(ij)  include  a 
copy  of  Wave  which  acts  as  a filtering  mechanism;  this  is  illustrated  in  Fig.  3. 

The  use  of  Wave'1  in  Fig.  3 is  purely  symbolic;  i.e.  symbolically  we  have 

Multiply  = Wave'1  o F o Wave 

Multiply  acts  as  a "block  server"  for  F , providing  the  latter  with  data  configured  properly  for  its  topology. 
The  F which  appears  above  is  the  kernel  of  the  corresponding  process  array  for  (Pj  ,Ej ). 

DEFINITION  7.1.  Suppose  F e PREMUL(s,m),  Typerange(  F ) = [a,b),  P is  a Nodelabel,  and  y is  a 
Permarray[q].  Then  define  f = R(F;P,y)  by 


procedure  f(al,bl:  Blockpointer; 

cl:  out  Blockpointer; 

P:  Nodelabel); 

var  nullaccum:  Blockpointer; 

b:  integer; 
begin 

< define  b,P,Y  > 

Wave(al,al,  nullaccum  J\Null,b,l,q,Y'1  Jq); 
Wave(bl,bl,nullaccumJ>J^ull,b+2q,l,q,Y'1  Jq); 
F(al,bl,cl,F); 

Wave(cl,cl,nullaccumJ),Null,b44q,l,q,Y,J<j) 


LEMMA  7.1.  Suppose  F e PREMUL(s,m),  Typerange(  F ) = [a,b),  P is  a Nodelabel,  and  y is  a 
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Permarray[q].  Let  f = R(F;P,y).  Then 


i.  f € PREMUL(s.m). 

ii.  Typerange(0  = [a,b+6q). 

iii.  If  y is  realizable  in  Nodearray  (P,E)  and  a process  on  P(ij)  makes  a call  of  the  form 
f(AL,BL,CL,P),  then  the  number  of  unit  block  transfers  incurred  by  the  process  in  executing  this  call  is 
6q  plus  the  number  incurred  in  the  call  F (AL,BL,CL,  P ). 

Proof:  all  sends  by  F have  a < type  < b.  In  f,  referring  to  the  format  of  Definition  7.1,  the  first  Wave  has 
smallest  type  = b for  a send  or  receive;  the  third  Wave  has  largest  type  = b+6q-l  for  a send  or  receive.  Also,  all 
sends  by  F or  Waves  have  identifier  1.  This  proves  (i)  and  (ii).  Now  trivially  y'1  is  realizable  in  (P,E).  Hence 
each  of  the  three  Waves  incurs  2q  unit  block  transfers  by  Lemma  5.4.  This  proves  (iii). 

□ 

THEOREM  7.1.  Suppose  f e PREMUL(sjn),  y is  a Permarray[q],  (P,E)  and  (P,E)  are  Nodearrays  and  (P, 
E)  is  equivalent  to  (P,E)  with  transformation  matrix  (Q(Lq))* . Then  Out(R(F ; P ,y);P,i J)  = Out(F ;P,ij). 

*y 

Proof:  abbreviate  o = ( Lq  )-1  . Let  f = R(  F ; P ,y)  and  Typerange(  F ) = [a,b).  By  Lemma  7.1  we  have  f e 
PREMUL(s,m)  and  Typerange(f)  = [a,b+6q).  Suppose  for  each  ij  e S,  P(ij)  has  some  process  proc(ij)  with 
identifier  1 which  calls  f(AL,BL,CL,P).  Suppose  that  at  the  time  of  this  call  *AL  = A(ij)  and  *BL  = B(ij).  Sup- 
pose any  send  with  identifier  1,  issued  by  any  process,  with  the  exception  of  sends  by  a proc(ij)  within  its  f,  has 
type  < a or  > b+6q.  Suppose  any  receive  issued  by  a proc(i  j),  with  the  exception  of  calls  within  its  f,  has  type  < 
a or  > b+6q. 

Now  for  the  call  to  f on  P(i  J),  referring  to  the  format  of  Definition  7.1,  we  have 

Claim  1:  on  exit  from  the  second  Wave  in  f on  P(ij),  *al  = A(a(i),o(j))  and  *bl  = B(a(i),a(j)). 

Claim  2:  on  exit  from  the  call  to  F in  f on  P(ij),  *cl  = Out(F ; P ,a(i),cr(j)). 

Claim  3:  on  exit  from  the  third  Wave  in  f on  P(i j),  *cl  = Out(F;P,ij). 

We  note  that  we  have  satisfied  the  hypothesis  of  Definition  4.5,  and  hence  Out(f;P,i  j)  is  the  value  of  *CL 
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on  exit  from  f on  P(i  j),  or  equivalently  the  value  of  *cl  on  exit  from  the  third  Wave  in  f.  Hence  the  Theorem 
follows  from  Claim  3. 

Proof  of  Claim  1:  on  entry  to  the  first  Wave  in  f on  P(ij),  *al  = A(ij).  In  Definition  5.1  applied  to  the  first 
Wave  we  have  h(0;ij)  = A(ij).  On  exit  from  this  call  *al  = h(q;i,j).  By  hypothesis,  any  send  outside  f with 
identifier  1 has  type  < b or  > b+2q.  Sends  and  receives  in  the  second  and  third  Waves  in  f have  type  > b+2q, 
and  sends  and  receives  in  F have  type  < b,  since  Typerange(  F ) = [a,b).  Thus  no  send  with  identifier  1,  outside 
the  first  Wave  in  f,  has  b < type  < b+2q.  By  hypothesis  no  receives  issued  in  proc(i  j)  outside  of  f have  b < type 
< b+2q.  The  same  is  true  for  F and  the  second  and  third  Waves  in  f.  Hence  proc(ij)  issues  no  receives  with  b < 
type  < b+2q  outside  the  first  Wave  in  f.  Thus  the  hypothesis  of  Lemma  5.2  is  satisfied  for  the  first  Wave  with  t 
and  n replaced  by  q and  y1 . Hence  on  exit  from  the  first  Wave  in  f on  P(i  j)  we  have 

*al  = h(q;i,j)  = h(0;  LqT  ’ (i),Lq'  (j))  = h(0;o(i),a(j))  = A(a(i),a(j)) 

Similarly,  on  exit  from  the  second  Wave  in  f on  P(ij)  we  have  *bl  = B(a(i),a(j)),  proving  Claim  1. 

Proof  of  Claim  2:  on  entry  to  the  call  to  F on  P(i  j),  we  have  *al  = A(a(i),o(j))  and  *bl  = B(a(i),a(j))  by 
Claim  1.  By  Corollary  2.1i,  P(ij)  = P (a(i),a(j)).  Hence  on  entry  to  F on  P(ij)  = P(a-1  (i).a'1  (j))  we  have  *al  = 
A(i  j)  and  *bl  = B(ij). 

Sends  and  receives  in  the  three  waves  in  f have  type  > b.  By  hypothesis,  sends  with  identifier  1 outside  f 
have  type  < a or  > b.  Hence  any  send  with  identifier  1 outside  F has  type  < a or  > b.  Similarly,  receives  issued 
by  the  proc  on  P (i  j)  (as  distinguished  from  proc(ij))  have  type  < a or  > b,  except  in  F . Hence  Definition  4.5 
applies  at  this  imbedded  level  with  f and  P replaced  by  F and  P.  Hence  the  value  of  *cl  on  P (i,j)  on  exit  from  F 
is  Out(  F ; P ,i  j).  Equivalently,  the  value  of  *cl  on  P (c(i),c(j))  on  exit  from  F is  Out(  F ; P ,a(i),a(j)).  But  by 
Corollary  2.1i,  P(c(i),a(j))  = P(i,j),  proving  Claim  2. 

Proof  of  Claim  3:  on  entry  to  the  third  Wave  in  f on  P(i  j),  by  Claim  2 we  have  *cl  = Out(  F ; P ,a(i),a(j)). 
Invoking  Definition  5.1  once  again,  h(0;ij)  = Out(  F ; P ,c(i),a(j)).  On  exit  from  the  third  Wave  on  P(ij),  *cl  = 
h(q;i  j).  A send  outside  f with  identifier  1 has  type  < a or  > b+6q.  Sends  and  receives  in  the  first  two  Waves 
have  b < type  < b+4q.  Hence  no  sends  with  identifier  1 and  b+4q  < type  < b+6q  are  issued  outside  the  third 
Wave.  Also,  proc(ij)  issues  no  receives  with  b+4q  < type  < b+6q  outside  the  third  Wave.  Thus  Lemma  5.2  ap- 
plies to  the  third  Wave  with  t,  b and  n replaced  by  q,  b+4q  and  y.  Hence  on  exit  from  the  third  Wave  on  P(i  j), 
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*cl  = h(q;ij)  = h(0;Lqr(i),LqY(j))  = Out(f  ;F,(c  ° Lq7)(i),(a  o Lqr)(j))  = Out(f;P,ij) 


This  proves  Claim  3. 

□ 

COROLLARY  7.1.  Suppose  n is  a Permarray[t],  v is  a Permarray[s],  x is  a Mask[t],  (n,\y,x)  e MUL(s),  y 
is  a Permarray[q],  (P0  JE0)  and  (P,  E, ) are  Nodearrays,  and  (P,  ) is  equivalent  to  (P0,E0)  with  transformation 

matrix  (Q(  Lq  ))  Let  f = R(F(  n,\\i,x  );?i  .Y)-  Then  f e PREMUL(s,m),  Typerange(0  = [3,b)  where  b > 3,  and 
Out(f;P0,ij)  = C(ij). 

Proof:  let  F = F(7t,\j/,x).  Then  by  Lemma  6.1,  T e PREMUL(s.m)  and  Typerange(  F)  = [3,b)  where  b > 3. 
Hence  f € PREMUL(s,m)  by  Lemma  7.1.  Since  ( 7t,y,x  ) e MUL(s),  Out(  f J*,  ,ij)  = C(ij).  By  Theorem  7.1, 
Out(f;P0,i,j)  = C(i,j). 

□ 

THEOREM  7.2.  Suppose  n is  a Permarray[t],  y is  a Permarray[s],  x is  a Mask[t],  (P0  £0)  is  as  in  Section 
( rc.VOC ) 6 KER(s;P0  ,E0 ),  (Pj  ) is  a Nodearray,  y is  a Permarray[q]  which  is  realizable  in  (P,  ),  and  (Pj 

E]  ) is  equivalent  to  (P0E0)  with  transformation  matrix  (Q(Lq))*.  Let 

Multiply  = (PROGLAYER  o R)(F(7T,v,x  );P,  ,y). 

Then  if  Asource,  Bsource,  Multiply,  and  Csink  are  the  only  processes  loaded, 

i.  Multiply(i,j),  Asource(ij),  Bsource(ij)  and  Csink(ij)  are  all  loaded  to  P0(ij). 

ii.  Multiply(ij)  receives  A(i  j)  and  B(i,j)  from  Asource(ij)  and  Bsource(ij),  respectively. 

iii.  Multiply(i  j)  sends  C(ij)  to  Csink(i  j). 

iv.  Multiply(ij)  incurs  at  most  2s+2n+6q  unit  block  transfers,  where  n is  the  maximum  column  sum  of 
X- 

Proof:  let  T = F(  rt.y.x  ) and  f = R(  T ;Pj  ,y).  By  Corollary  7.1,  f e PREMUL(s,m).  Now  (i)  is  direct  from 
Definitions  4.1  - 4.3  and  6.5.  In  Lemma  4.3  take  procarray  = Multiply;  (ii)  is  immediate  from  Lemma  4.3i. 
Also  by  Corollary  7.1  we  have  Out(f;P0  ,i,j)  = C(i,j).  Lemma  4.3ii  gives  (iii).  Referring  to  the  format  of 
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Definition  6.5,  the  unit  block  transfers  of  Multiply(ij)  are  incurred  by  the  execution  of  g(P0)  on  P0(i,j),  where  g 
= NETLAYER(f).  As  in  the  proof  of  Theorem  6.2,  the  only  unit  block  transfers  incurred  by  the  call  to  g are  in- 
curred by  its  call  f(al,bl,clj*0  ) in  the  format  of  Definition  4.6.  By  Lemma  7.1  iii,  the  number  of  unit  block 
transfers  incurred  by  this  call  is  at  most  6q  plus  the  number  in  f (al,bl,cl,P,  ).  By  Lemma  6.  liii  the  latter 
number  is  at  most  2s+2n. 

□ 

We  remark  that  if  e is  the  unique  Permarray[0]  with  no  components,  then  (Q(Lq))‘  = I.  In  Theorem  7.2  we 
may  take  P,  = P0  ; these  are  equivalent  with  transformation  matrix  I.  Also  f and  R(  T ;P0  £)  are  indistinguish- 
able. Equating  the  latter  we  can  view  Theorem  6.2  as  a special  case  of  Theorem  7.2. 

Also,  as  noted  in  the  Introduction,  the  machinery  of  [20]  would  be  useful  here.  We  could  then  write  state- 
ments of  the  form  x = p(y),  where  x is  a variable  defined  on  a node  and  y is  a variable  defined  on  another  node 
p.  This  concept  could  be  extended  to  procedure  names  as  well.  This  would  simplify  and  clarify  references  such 
as  "the  proc  on  P(i,j)."  We  have  not  used  this  machinery  for  two  reasons.  The  compiler  generates  communica- 
tion statements;  this  would  obscure  the  analysis  of  communication  cost.  Also,  no  existing  machine  incorporates 
this  machinery. 


39 


8.  Case  study:  maximal  interconnection. 

For  comparison  purposes  we  examine  the  Nodearray  (P0  JE0 ) with  E0  = Js  - 1;  i.e.  each  pair  of  nodes  in  a 
row  or  column  is  adjacent,  the  maximal  connectivity  permitted  if  we  assume  that  a node  is  not  adjacent  to  itself. 
All  Permutations  are  realizable.  Let  Pstan(i  j)  = s*i+j  and  let  P0  = Pstan.  Pstan  is  illustrated  in  Fig.  2 for  s = 4. 

If  x = q*y+r  and  0 < r < y we  define  x%y  = r,  x/y  = q.  Let  X be  the  cyclic  Permutation  defined  by 

X(j)  = (j  e S) 

Let  re  be  the  Permarray[s]  defined  by  rrz  = X1  and  let  X = I-  Then  W z = X'z.  Let  y be  the  Permarray[s] 

defined  by  yz  = X,  z e S.  Then  Lz  = X'z , 0 < z < s.  Hence 

WZR  [zj]  = (j+z)%s  (zj  € S) 

[z j]  = (j+z)%s  (1  < z < s,  j e S) 

THEOREM  8.1.  For  a maximally  connected  s x s Nodearray  we  can  find  a process  array  Multiply  meeting 
the  specifications  of  Section  1,  with  each  Multiply(ij)  incurring  2s+2  unit  block  transfers. 

Proof:  without  loss  of  generality  take  (P0  ,E0 ) and  ( rc.y.x ) as  above.  Then  n is  symmetric  and  Ly  is  a 
Latin  square.  Also 

W * o L^  = L^  o W * = X-  k (i  e S;  1 < k < s) 

Hence  (rr,y,x  ) e KER(s),  and  trivially  (rt,y,x ) € KER(s;P0  ,E0).  By  Theorem  6.2,  if  each  Multiply(ij)  = 
Mulproc(s,m,sJ>0  , rt,y,x  ).  Multiply  satisfies  the  conditions  of  Section  1.  Since  all  column  sums  of  x = I are  1, 
the  result  follows. 

□ 
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9.  Case  study:  toroidal  mesh. 


Assume  s > 2;  s = 1 is  trivial  and  s = 2 is  covered  in  the  previous  section.  Let  0 be  the  Permarray[2]  with 
0O  = X,  0j  = X'1  , where  X is  as  in  Section  8.  Let  Ecyc  = INC(0).  Ecyc  is  illustrated  for  s = 4 in  Fig.  2.  Let  P0  = 
Pstan  as  in  Section  8,  and  let  E0  = Ecyc.  Then 


E0fiJ]  = 


(i  = (j-l)%sorj  = (i-l)%s) 


(otherwise) 


Thus  each  node  is  adjacent  to  its  four  nearest  neighbors,  with  horizontal  and  vertical  connections  between 
the  edges  of  the  array;  i.e.  (P0,E0)  is  a toroidal  mesh.  Let  7t  be  the  Permarray[s-1]  defined  by 


^ X (z  < s/2  - 1) 

^ X'1  (z  > s/2  - 1) 


for  0 < z < s-2.  Let  x be  the  Mask[s-1]  defined  by 


Xtzjc]  = 


c 


((z  < k < s/2)  or  (s/2  < k < 3s/2  - z and  z > s/2)) 
(otherwise) 


for  0 < z < s-2  and  k e S.  Then  for  ke  S, 


x.*  -X[0J0  -xfs-2Jc] 

™ k ~ no  o ...  o 71  s 2 


-XfOJc] 


-Z[s/2-lj£]  -x[s/Zk] 
: s/2-1  °nt/2 


-Xts-2Jc] 

s-2 


Symbolically 


logx  W J'*  = - £ XfcM  + Z XU.k] 

1=0  i=s/2 


s/  2-1 


s-2 


{ 


-k 


s-k 


(k  <=  s/2) 
(k  > s/2) 


-k 


(ke  S) 
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The  last  step  uses  Xs  = X° . It  follows  that 


W = X'k  (ke  S) 


We  note 


I 


i=0 


XM  = 


(k  <=  s/2) 
(k>s/2) 


Thus  the  maximum  column  sum  of  X is  s/2. 


THEOREM  9.1.  For  an  s x s toroidal  mesh  we  can  find  a process  array  Multiply  meeting  the  specifications 
of  Section  1,  with  Multiply(ij)  incurring  at  most  3s  unit  block  transfers. 


Proof:  using  (P0  ,E0 ),  n and  x as  above  and  taking  y as  in  Section  8,  we  note  that  Wz?c  and  Lv  are  as  in 

Section  8.  Thus  the  proof  of  Theorem  8.1  applies,  except  that  Multiply(ij)  = Mulproc(s,m,s-iP0  , n,y,x  ),  and 

the  number  of  unit  block  transfers  incurred  by  a Multiply(ij)  is  at  most  2s  + 2(s/2)  < 3s.  This  holds  for  all  s > 1. 

□ 


REMARK  9.1.  We  have  routed  data  in  such  a way  as  to  minimize  communication  cost.  However,  a 
simpler  algorithm  (e.g.  [15]  p.  126)  is  obtained  by  letting  nz  = X for  0 < z < s-2,  and  xtz.k]  = 1 for  0 < z < k 
and  0 for  k < z < s-2,  k e S.  Then  the  maximum  column  sum  of  x rises  to  s-1,  and  the  number  of  unit  block 
transfers  becomes  4s-2. 

□ 
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10.  Butterflies  and  Gray  codes. 


Here  we  record  some  facts  about  some  classical  families  of  permutations.  Given  Boolean  xd_j  , ...  , x0  let 

(xd  l . - , x0  )2  = i xz  * 2Z 
2=0 

If  x = (xd , , ...  , x0)2  we  note  xz  = (x/2z)%2.  Given  nonnegative  integers  x(1)  , ...  , x(r) , suppose 
x°°  = (x  d\  ...  , x^j  (1  < k < r) 

Then  let 


EX(x<» x»)  = ((  t x“  ) %2 ( £ x“)  %2  )2 

*=1  *=1 

In  particular,  EX(x,y)  is  the  exclusive-or  of  the  bit  strings  representing  x and  y.  We  note  EX(x)  = x = 
EX(x,0),  EX(x,y)  = EX(y,x),  EX(x,EX(y,z))  = EX(x,y,z),  EX(x,x)  = 0,  and  EX(x,y)/2z  = EX(x/2  z,y/2  *).  Also,  if 
EX(x,y)  = EX(x,w)  then  y = w.  Furthermore,  if  x = (xd_j  , ...  , x0)2  and  0 < z < d,  then 

EX(x,2  ) — (xd  j , ...  , x^+j  »1"^2  ,xz  l , ...  , x0)2 

For  a fixed  z,  EX(x,2z)  generates  the  familiar  butterfly  data  flow,  illustrated  in  Fig.  4 for  x < 7 and  z < 2. 
Now  EX  is  not  generally  additive;  however  an  important  exception  is 

LEMMA  10.1.  If  a0  , ...  , ad-1  are  Boolean  then 

EX(a0  * 2°  , ...  , ad.!  * 2d  l ) = EX(  £ az  * 2Z  ) 

2=0 

Proof:  trivial. 

□ 

Define 

GC(i)  = EX(i,i/2).  (i  > 0) 

It  is  easily  verified  that  GC  is  injective.  The  sequence  (GC(0),  GC(1),...}  is  the  so-called  binary  reflected 

Gray  code  (e.g.  [17]).  Usually  it  is  developed  recursively,  but  here  we  have  given  it  a direct  definition.  To  ex- 
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plore  its  property  of  interest,  let 


Z(i)  = EX(GC(i),GC(i-l))  = EX(i,i-l,i/2,(i-l)/2)  (i  > 1) 

Z(0)  = 0 

Then  we  have 

LEMMA  10.2.  For  i > 1,  Z(i)  is  the  highest  power  of  2 dividing  i. 


Proof:  let  i = 

(id-i 

, ...  , ip  )■}  • 

Suppose  ir  = 0 for  r < z. 

and  iz 

i 

= 

(*d-l  • 'd-2  ’ 

...  , iz+1  ,1  ,0,0,... 

.0) 

i-1 

= 

(’d-1  ’ *d-2  » 

...  , iz+1  ,0  ,1,1,... 

,D 

i/2 

= 

* *d-l  • 

» *z+2  ’ i2+l  » 1 » ^ 

,0) 

(i-l)/2 

(0  . id_i  . 

•"  » *z+2  ’ *z+l  » 0 » 1 • " 

• ,1) 

It  follows  that  Z(i)  = 2Z. 

□ 

COROLLARY  10.1.  For  i > 0,  the  binary  representations  of  GC(i)  and  GC(i+l)  differ  in  exactly  one  posi- 
tion. 

Proof:  immediate  from  Lemma  10.2. 

□ 

LEMMA  10.3.  For  i > 0,  EX(Z(0),...,Z(i))  = GC(i). 

Proof:  this  is  true  for  i = 0;  assume  true  for  some  i.  Then 

EX(Z(0)  , ...  , Z(i+1))  = EX(EX(Z(0) Z(i)),Z(i+l)) 

= EX(GC(i),Z(i+l))  = EX(GC(i),EX(GC(i),GC(i+l))) 

= EX(GC(i),GC(i),GC(i+l))  = EX(GC(i+l)) 
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The  result  follows  by  induction. 


□ 


LEMMA  10.4.  For  i > 0,  GC'G)  = EX(i,i/2,i/4,i/8,...). 


Proof:  we  note  GC(i)/2->  = GC(i/2J  ).  Hence 


EX(GC(i),GC(i)/2,GC(i)/4,  ...)  = EX(GC(i),GC(i/2),GC(i/4),  ...) 

= EX(i,i/24/2/i/4,i/4j/8,  ...)  = EX(i)  = i 

The  result  follows. 

□ 
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11.  Case  study:  hypercube. 


In  this  case  we  take  s = 2d.  Let  P0  = Pstan  as  in  Section  8.  Let  it  be  the  discordant  Permarray[d]  defined  by 


rtz(i)  = EX(i,2z) 


(i  e S,  0 < z < d) 


Let  Ehyp  = INC(n)  and  let  E0  = Ehyp.  The  latter  is  illustrated  in  Fig.  2 for  s = 4.  Let  D = {0, ...  ,d- 1 } . By 
Lemma  3.3,  for  i j e S the  nodes  adjacent  to  P0(ij)  are 


{Po(‘^zO))}ze  D u {P0(Mi)j)}z€D 


Now  k and  nz  (k)  differ  only  in  the  z-th  digit  of  their  binary  representations.  Hence  the  same  is  true  for 
P0  (ij)  = s * i + j and  P0  (i,7tz  (j))  = s * i + Ttz  (j)  for  i j e S.  Similarly  s * i + j and  s * 7tz  (i)  + j differ  only  in 
one  position.  Thus  in  (P0  ,E0 ),  nodes  are  adjacent  iff  their  binary  representations  differ  in  exactly  one  position. 
Hence  (P0,E0)  is  a hypercube  of  dimension  2d  [21]. 

We  note  that  if  a is  Boolean,  n*(j)  = EX(j,a  * 2Z),  where  as  usual  o°(j)  = j and  a1  (j)  = a(j)  for  a permu- 
tation a.  More  generally  we  have: 

LEMMA  11.1.  If  a0  , ...  , ar  are  Boolean  and  r e D, 


i=0 


Proof:  the  above  is  true  for  r = 0;  assume  true  for  r-1,  0 < r < d.  Then 


( *0  <>  - ® 7t*r  )(j)  = ( tto1  0 - ° 71  r-l  X V 0)) 


The  last  step  follows  from  Lemma  10.1.  The  result  follows  by  induction. 


□ 


We  note  also  n'z  = nz . Now  let  x be  the  Mask[d]  defined  by 


Xtzjc]  = (k/2z  )%2 


(z  e D;  k e S) 


LEMMA  11.2.  For  kj  6 S,  wj*(j)  = EX(j Jc). 
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Proof:  we  have 


W,  G)  = (*o 


«[0Jt] 


-X[d-ljc] 
' d-1 


° K * , )(])  = ( « o 


Xiojt] 


o n 


_X[d-U] 

d-1 


)G) 


By  Lemma  11.1, 


wr  = EX(j,  X XfeM  * 2Z ) 

i=0 

The  result  follows. 


Let  GC  and  Z be  as  in  Section  10  and  let  y be  the  Permarray[s]  defined  by 
Vq  (j)  = EXOZ(q))  (qj  e S) 

We  note  that  y * = yq . 

LEMMA  1 1.3.  For  1 < i < s we  have 
LV=W*,,C 

” GC(i-l) 

Proof:  we  claim 

(Vr  o o V,,  )G)  = EX02(r), ...  Z(i-l))  (0  < r < i) 

This  is  true  for  r = i-1;  suppose  true  for  some  r > 0.  Then 

(Vr.,  ° - o Vi.!  )G)  = EX(EX(jZ(r),  ...  ,Z(i-l)),Z(r-l))  = EXQZfr-l),  ...  ,Z(i-l)) 
Hence  the  claim  follows  by  induction.  Now  taking  r = 0, 

(Vo  ° - ° Vi-!  )G)  = (Vo  o ...  ° Vi-1  )G) 

= L^G)  = EX(jZ(0),  ...  Z(i-D)  = EXG,GC(i-l)) 

The  last  step  uses  Lemma  10.3.  The  result  follows  from  Lemma  11.2. 


□ 


□ 
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LEMMA  11.4.  For  i,k  € S, 


W 


k 


= w 


x.* 

EX(iJ0 


Proof:  by  Lemma  11.2  we  have 

( Wf*  o wj*  )(j)  = EX(  Wr  0)4)  = EX(EX(kj),i) 
= EX(ij,k)  = EX(j£X(i.k))  = W^(uk)0) 


□ 


LEMMA  11.5.  For  i € S and  1 < k < s we  have 


Proof:  immediate  from  Lemmas  11.3  and  11.4. 


□ 


LEMMA  11.6.  For  Tt.y.x  as  above,  (rt.y.X)  e MUL(s). 

Proof:  by  Lemma  11.2  we  have  WX7t  (zjc)  = EX(z,k).  Hence  WXJt  is  symmetric.  Also,  by  Lemma  11.3, 

Ly  [zjc]  = EX(GC(z-l)Jc).  Now  suppose  Ly  [z,k]  = Ly  [z\k]  . Then  EX(GC(z-l)Jc)  = EX(GC(z'-l)Jc)  and 
hence  GC(z-l)  = GC(z'-l).  Since  GC  is  injective,  z-1  = z'-l  and  z = z . Thus  Lv  is  a Latin  square.  The  result 
follows  from  Lemma  11.5. 

□ 

THEOREM  11.1.  For  a hypercube  with  s2  nodes  we  can  find  a Nodearray  and  a process  array  Multiply 
meeting  the  specifications  of  Section  1 with  Multiply(ij)  incurring  at  most  2(s  + log2  s)  unit  block  transfers. 

Proof:  let  (P0  ,E0 ) = (Pstan£hyp).  Let  n.y.x  be  as  above  and  let  Multiply(ij)  = Mulproc(s,m,d ,P0  , rt.V.X  )• 
Trivially  tc  is  realizable  in  (P0  ).  Also,  \\ii  = 7tz  where  z = Sog2  Z(i).  Hence  \\i  is  realizable  in  (P0  ).  By 

Lemma  11.6  and  Theorem  6.2,  Multiply  satisfies  the  conditions  of  Section  1.  Also,  the  maximum  column  sum  of 
X is  d = log2  s.  The  result  follows. 

□ 

As  noted  in  the  Introduction,  this  adapts  the  Dekel/Nassimi/Sahni  algorithm  [1]  to  a wavefront  setting. 
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Furthermore,  Theorem  6.1  characterizes  the  properties  of  this  solution  which  make  it  viable;  this  is  somewhat 
hidden  in  [1],  as  indicated  by  the  proof  of  Theorem  6.1.  Extension  to  the  non-square  matrix  case  may  be  found 
in  [6]. 

: - ..'v  r ;■  ■ • • ■ ■>  - -■  \ • 
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12.  Case  study:  mesh  embedded  in  hypercube. 

Here  we  assume  s = 2d  , d > 1,  and  that  nodes  in  (Pstan,Ehyp),  as  in  Section  11,  are  adjacent  iff  their 
binary  representations  differ  in  exactly  one  postion.  In  other  words,  we  again  assume  the  underlying  physical 
network  of  nodes  is  a hypercube.  However,  this  time  we  assume  that  Asource,  Bsource  and  Csink  prefer  a 
different  topology  than  in  Section  11.  For  example  they  may  prefer  a subset  of  a barrel  shifter  (or  PM2I)  topolo- 
gy. Assume  it  is  required  that  (P0,E0)  has  an  embedded  toroidal  mesh  as  a subnetwork;  i.e.  with  Ecyc  as  in  Sec- 
tion 9 and  with  the  notation  of  Definition  2.3,  according  to  Lemma  2.3iii  we  want  (P0  .Ecyc)  c (P0  £ 0 ). 
Equivalently,  P0(ij)  is  to  be  adjacent  to  P0(i,(j+1)%2)  and  P0((i+l)%2j).  Let  (PltE,)  = (Pstan,Ehyp)  and  let  n 
and  x be  as  in  Section  11.  Now  to  employ  the  machinery  of  Section  7,  we  want  to  find  Permutation  a so  that 
(Pj  ,Ej  ) is  equivalent  to  (P0  £0)  with  transformation  matrix  Q(a).  The  condition  (P0  ,Ecyc)  c (P0  ,E0)  is 
equivalent  to 

E0(i,(i+l)%s)  =1  (i  £ S) 

By  Corollary  2.1ii  this  is  equivalent  to 

E,(a(i),o((i+l)%s))=  1 (i  e S) 

Since  7t  is  a basis  for  Ej , the  above  is  equivalent  to  c((i+l)%s)  = Jt2(cr(i))  for  some  z = z(i).  For  example, 
if  GC  and  Z are  as  in  Section  10  and 

{log2  Z(i+1)  (0  < i < s-1) 

d-1  (i  = s-1) 

then  we  find  a(s-l)  = 2d_1  and 

o(i+l)  = EX(o(i),GC(i),GC(i+l))  (0  < i < s-1) 

This  suggests  that  we  choose  a = GC.  This  is  the  standard  choice;  e.g.  it  is  also  used  in  [3].  Let  T = 
Q(GC),  Pgc  = T o Pstan  o T*  and  Ecyc  = T o Ehyp  o T*  ; then  we  may  take  (P0,E0)  = (Pgc,Ecyc).  The  case  s 
= 4 is  illustrated  in  Fig.  2.  Also,  GC'1  is  computed  in  Lemma  10.4.  To  exploit  the  machinery  of  section  7,  let  y 
be  the  Permarray[d-1]  defined  by 
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Yz(k)  = 7rf+1Jc]  (k) 


(0  < z < d-1,  k e S) 


We  note 

yz(k)  = EX(k,x[z+U]  * 2Z) 

Thus  Yz(k)  and  k have  binary  representations  which  do  not  differ  except  possibly  in  the  coefficients  of  2Z. 
Hence 

Xtr.YzOO]  = XWtf  (0  < r < d-1,  r * z) 

LEMMA  12.1.  For  0 < r < d-1  and  k e S we  have 

( Yd.2  ° - o Yr  )(k)  = EX(k,  “£  xU+Uc]  * 2Z ) 

Z-r 

Proof:  We  claim 

( Yd_2  o ».  « Yr  )(k)  = EX(k,  Xtr+l,k]  * 2r xW-UO  * 2d'2  ) 

This  is  true  by  definition  for  r = d-2;  assume  true  for  some  r,  0 < r < d-1.  Then 

( Yd.2  o - ° Y,,  )Ck)  = EX(  YM(k),  xfr+1,  Yr.,(k)]  * 2r , ...  , X[d-1,  Yr.,  (k)l  * 2d'2  ) 

= EX(k,  x[rW  * 2r_1 Xtd-l,k]  * 2d  2 ) 

The  claim  follows  by  induction.  The  Lemma  follows  from  Lemma  10.1. 

□ 

LEMMA  12.2.  We  have  GC  = ( lJ. , )'1 . 

Proof:  by  Lemma  12.1  we  have 

( Lj.j  )->  (k)  = EX(k,  £ x[z+l,k]  * 2Z ) (k  € S) 

1=0 

Now  for  k e S,  we  have  x[z+l,k]  = xt2^]  and  k/2  < 2d_1 . Thus 

d-2  d-2 

£ xtz+l*]  * 2Z  = £ xtzJ^/2]  * 2Z  = k/2 
2=0  2=0 
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The  result  follows. 


□ 

THEOREM  12.1.  For  a hypercube  with  s2  nodes  configured  with  an  embedded  toroidal  mesh  as  a subnet- 
work, we  can  find  a process  array  Multiply  meeting  the  specifications  of  Section  1 with  Multiply(ij)  incurring  at 
most  2(s  - 3 + 4 * log2  s)  unit  block  transfers. 

Proof:  we  have  (P0,E0)  = (Pgc,Egc)  as  above.  Taking  (Pj  .Ej ) = (Pstan,Ehyp),  y as  above  and  7t,\jr,x  as  in 
Section  11,  by  Lemma  11.6  we  have  (rc.y.X)  € MUL(s).  Now  by  Lemma  12.2,  since  (Pj^j)  is  equivalent  to  (P 
0,E0)  with  transformation  matrix  Q(GC)  = (QCGC'1  ))*  , Theorem  7.2  applies  with  q = d-1.  The  column  sums  of 
X are  at  most  d = log2  s.  The  result  follows. 

□ 

We  remark  that  Theorem  12.1  exhibits  a communication  cost  6 * log 2 s/2  unit  block  transfers  higher  than 
in  Theorem  11.1.  This  is  due  to  the  fact  that  the  blocks  of  A and  B are  distributed  unfavorably  when  received 
from  Asource  and  B source,  and  must  be  "descrambled"  by  filtering.  In  both  theorems  the  communication  cost  is 
2s  + 0(log  s),  and  hence  the  cost  of  filtering  is  asymptotically  negligible.  This  is  why  we  did  not  bother  to  con- 
cern ourselves  directly  with  the  topology  of  (P0,E0).  If  the  latter  is  of  interest,  we  can  use  Lemma  3.3ii  to  find  a 
basis  for  E0 . 
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13.  Comparisons. 

For  comparing  results  we  will  employ  the  unit  block  transfer  for  measuring  communication  complexity;  as 
defined  earlier  this  is  the  transfer  of  a block  consisting  of  m 2 matrix  elements  between  neighboring  nodes. 
Results  are  recorded  in  Table  1.  In  the  topology  column,  mesh  refers  to  an  ordinary  toroidal  mesh  as  defined  in 
Section  9.  Mesh  in  cube  refers  to  a toroidal  mesh  imbedded  in  a hypercube,  as  in  Section  12  or  [3];  in  [6]  this  is 
referred  to  as  a Gray  code  encoding.  Cube  refers  to  the  arrangement  of  a hypercube  as  in  Section  11  or  [1];  in 
[6]  this  is  referred  to  as  a binary  encoding.  Maximal  refers  to  the  maximal-connectivity  configuration  of  Section 
8. 

In  the  algorithm  column  of  Table  1,  FOH  refers  to  Fox/Otto/Hey  [3].  The  basic  algorithm  there  involves  s 
iterations.  Each  iteration  involves  "rolling"  blocks  of  B vertically  in  parallel  at  a cost  of  one  unit  block  transfer, 
and  then  broadcasting  blocks  of  A from  one  position  of  each  row  to  all  other  row  positions.  Each  broadcast  is  a 
de  facto  synchronization  point  FOH-chain  (listed  here  for  later  purposes,  but  omitted  in  [3])  is  the  simplest  im- 
plementation of  broadcast,  i.e.  passing  the  block  serially  through  the  s row  positions;  each  node  incurs  a delay 
equivalent  to  s unit  block  transfers  per  iteration.  This  does  not  use  cube  connectivity,  and  hence  is  equally  appli- 
cable to  a mesh  or  mesh  in  cube.  FOH-tree  performs  the  broadcast  by  embedding  a tree  in  each  row  of  a mesh 
in  cube,  requiring  at  most  log2  s unit  block  transfers  per  node  per  iteration.  FOH-ring  refers  to  a modification  of 
chaining  by  pipelining  packets  of  a block  through  a row  of  a mesh  in  cube  (the  split-ring  broadcast  in  [3]).  The 
communication  complexity  involves  n,  the  number  of  packets  into  which  a block  decomposes.  Again  this  is  ap- 
plicable to  a mesh  or  mesh  in  cube.  FOH-hard  refers  to  broadcast  implemented  by  hardware  support;  arbitrarily 
we  assume  that  this  involves  the  same  cost  as  one  unit  block  transfer. 

MCBVDV  refers  to  the  algorithm  in  [12],  which  does  not  employ  block  decomposition.  If  the  communica- 
tion cost  is  converted,  it  is  equivalent  to  about  s2  unit  block  transfers  (cf.  [16]). 

It  may  be  noted  that  most  results  in  Table  1,  and  those  developed  herein  in  particular,  are  independent  of 
hardware.  As  we  noted  in  the  Introduction,  augmenting  the  hardware  of  a cube  with  a facility  for  broadcasting  to 
subcubes  is  strictly  a luxury  in  this  context;  even  if  the  cost  of  FOH-hard  is  less  than  2s  as  listed,  it  must  exceed 
s (the  cost  of  rolling  the  B blocks),  and  hence  cannot  be  much  of  an  improvement  over  the  2s  + log2  s complex- 
ity of  the  wave  adaptation  of  Dekel/Nassimi/Sahni  for  cubes.  Asymptotically  the  same  is  true  for  the  cost  of 
wave  for  mesh  in  cube.  We  also  note  that  increasing  the  connectivity  to  the  maximum  gives  only  a negligible 
improvement  over  the  hypercube.  In  tum,  the  hypercube  does  not  improve  much  over  the  mesh  in  this  context 
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(2s  + log  2 s,  versus  3s). 

It  should  also  be  noted  that  the  communication  complexity  of  the  algorithm  in  [3]  is  given  as  essentially  2s. 
This  is  based  on  FOH-ring  and  the  assumption  that  n » s2  . With  the  latter  inequality,  pipelining  of  the  n 
packets  of  a block  through  the  s row  positions  is  very  efficient  in  implementing  broadcast.  However,  the  as- 
sumption n » s 2 is  specific  to  the  Mark  II  and  its  8-byte  packets;  in  attempting  to  port  FOH-ring  to  the  iPSC 
and  its  1024-byte  packets,  the  inequality  n » s2  may  become,  e.g.,  1 » 64.  In  general  the  communication  com- 
plexity of  FOH-ring  could  range  from  O(s)  to  0(s2  ) depending  on  m and  architecture.  In  the  Introduction  we 
substituted  0(s  * log  s)  for  an  upper  bound  on  this  work,  based  on  FOH-tree  which  is  independent  of  architec- 
ture or  m. 

The  impact  of  communication  complexity  depends  on  other  costs  such  as  computation,  overhead  and  syn- 
chronization, as  well  as  the  translation  of  unit  block  transfers  to  actual  time  cost.  Computation  time  cost  is 
roughly  of  the  form  a * M3  /s2  for  some  constant  a.  Time  cost  of  a unit  block  transfer  might  be  modeled  as  L + 
b * n,  where  L is  a constant  representing  latency,  b is  a constant  and  n is  the  number  of  packets  in  a block.  For 
simplicity  we  will  measure  packet  size  in  terms  of  the  number  p of  matrix  elements  in  a packet;  then  n = [m2 
/pi,  where  fx"|  denotes  the  least  integer  > x.  For  example,  if  elements  occupy  8 bytes,  the  Mark  II  has  p = 1 and 
the  iPSC  has  p = 128.  With  these  assumptions,  execution  time  may  be  modeled  as 


time 


a*s*m 3 + 


U(s)*L  + U (s)*b* 


+ V(mj) 


where  U(s)  is  number  of  unit  block  transfers  and  V(m,s)  includes  overhead  and  synchronization  costs.  The  latter 
includes  delays  when  a node  cannot  compute  because  it  is  waiting  for  receipt  of  a block.  Presumably  V(m,s) 
can  be  controlled  by  buffering,  use  of  optimal  code,  etc.;  nonetheless  it  seems  difficult  to  estimate  V with  any 
precision.  We  will  assume  V(m,s)  = o(s*m2 );  this  is  probably  too  high  for  large  m,  but  suffices  for  analysis. 

In  the  expression  for  time  we  note  that  any  term  could  dominate.  For  example,  V(m,s)  may  be  predom- 
inant for  small  m;  a*s*m3  dominates  if  s is  small  and  m is  large;  U(s)*L  may  dominate  if  m is  small,  s is  large 
and  L » b;  and  U(s)*b*[m2  /p]  may  dominate  if,  e.g.,  U(s)  = s2  , m = s1/2  and  s is  large.  Now  if  we  take  U(l) 
= V(m,l)  = 0,  we  have  also 


efficiency 


B (s)*L+U  (s)*b* 


1 + 


m 

P 


+V  (m  ,s) 


a*s*m 3 
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In  particular  we  have 


efficiency  = — - — ( U (s)  = 0 (s)  ) 

1 + 0(— ) 
m 

We  also  note 

efficiency  = — ( U(s)  = 0(s2)  ) 

1 + 0(— ) 
m 

In  general,  if  we  define  granularity  as  ratio  of  computation  to  communication  cost, 

ef f iciency  = — 

l + °( 7—) 

granularity 

We  note  the  central  role  played  by  m in  these  analyses,  as  opposed  to  M.  This  shows  that  problem  size,  per 
se,  is  a secondary  parameter  in  designing  and  choosing  algorithms  for  applications  such  as  matrix  multiplication. 
Granularity,  e.g.  defined  by  ratio  of  computation  to  communication  cost,  is  much  more  relevant  for  message- 
passing machines.  We  observe  that  granularity  here  depends  on  m if  U(s)  = O(s),  and  on  m/s  if  U(s)  = 0(s2  ). 
Although  some  of  the  algorithms  in  Table  1 fall  in  between  these  extremes,  it  is  instructive  to  note  the  conse- 
quences of  this  dichotomy.  In  particular,  an  O(s)  communication  complexity,  as  represented  by  U(s),  has  the 
property  of  guaranteeing  that  at  least  minimal  efficiency  is  attained  for  any  architecture  and  any  values  of  M and 
s (of  course  if  s > M only  a subset  of  nodes  is  used;  hence  we  may  restrict  s < M in  practice).  To  make  this 
more  concrete,  we  introduce: 

DEFINITION  13.1.  We  say  that  an  algorithm  is  uniformly  asymptotically  stable  with  respect  to  a class  of 
architectures  E if  for  each  instance  of  E there  exists  a constant  e0  such  that  efficiency  > e0  for  any  problem  size 
and  number  of  nodes. 

□ 

This  definition  is  more  of  a working  nature  than  theoretical:  in  practice  the  number  of  nodes  permitted  and 
problem  size  will  be  restricted  by  the  communication  and  memory  subsystems,  respectively,  for  a given  architec- 
ture. Nonetheless,  we  conclude  that  the  algorithms  in  Table  1 with  U(s)  = O(s),  and  hence  granularity  m,  are  un- 
iformly asymptotically  stable  with  respect  to  most  existing  message-passing  machines.  In  fact,  as  m ->  °°, 
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efficiency  — > 100%.  On  the  other  hand,  if  U(s)  = 0(s2 ),  and  hence  granularity  = m/s,  we  may  have  efficency  — > 
0 for  large  s.  For  example,  the  Connection  Machine  [23]  has  s = 256,  and  if  a matrix  element  occupies  8 bytes 
we  have  m < 8.  Hence  m/s  < 1/32,  forcing  small  granularity  if  U(s)  = 0(s2  ).  This  is  due  a priori  to  memory 
limitations  on  the  Connection  Machine,  but  more  generally  we  note  that  execution  time  is  at  least  a*M3  /s2  = 
a*s4  (m/s)3  . Thus  on  any  machine  m/s  cannot  be  large  for  large  s.  If  processor  power  is  considered  the  situa- 
tion exacerbates.  For  example,  suppose  processors  are  implemented  on  single  chips  whose  monetary  cost  rises 
linearly  with  area,  and  which  obey  an  area-time2  tradeoff  on  computation.  The  total  cost  of  the  s2  processors  is 
then  0(s2  /a2  ),  so  that  a will  in  general  grow  as  c*s  rather  than  remain  constant  for  large  s.  In  this  case  it  is 
more  accurate  to  model  execution  time  as  at  least  c*M3  /s  = c*s5  (m/s)3  . Thus  if  granularity  is  measured  by 
m/s,  small  granularity  is  virtually  forced  in  this  context  for  massively  parallel  machines.  This  suggests  that  port- 
ing algorithms  such  as  MCBVDV,  FOH-ring  and  FOH-chain  to  massively  parallel  machines  is  unwise. 

In  contrast,  if  s < 4 the  choice  of  algorithm  is  probably  unimportant.  Table  2 gives  the  results  of  some  runs 
on  a 16-node  iPSC,  with  execution  time  in  seconds.  The  algorithms  tested  are  seen  to  be  indistinguishable.  The 
reason  is  the  last  column,  which  gives  the  computation  time  for  the  matrix  multiplication  phase,  with  communi- 
cation excised.  This  shows  that  high  efficiency  is  attained  even  for  relatively  small  M,  e.g.  M > 64.  This  is 
predicted  by  our  model:  even  if  granularity  is  measured  by  m/4,  for  M > 64  we  have  m/4  > 4,  so  that  high 
granularity  is  attained  quickly. 

We  note  also  that  FOH-chain  is  easier  to  implement  than  the  other  three  algorithms;  hence  it  seems  to  be  a 
good  choice  for  this  particular  machine  despite  its  0(s2  ) communication  cost.  On  the  other  hand,  we  note  that 
for  s = 4,  m = 64,  communication  cost  may  be  as  much  as  one  second.  Our  theory  predicts  that  this  cost  would 
rise  to  about  64  seconds  on  a similar  (but  nonexistent)  1024-node  machine,  compared  to  a 50-second  computa- 
tion time.  The  wave  version  should  reduce  communication  cost  to  about  5 seconds  in  this  case. 
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14.  Conclusions. 


Using  matrix  multiplication  as  an  example,  we  have  explored  various  aspects  of  computing  on  distributed- 
memory,  message-passing  machines.  The  focus  has  been  environmental  rather  than  algorithmic,  except  for  Sec- 
tion 6.  We  have  noted  the  importance  of  the  interface  between  processes,  and  its  connection  with  distributed 
data  structures.  We  have  seen  that  wavefront  computing  is  an  effective  means  of  implementing  algorithms;  futh- 
ermore,  we  have  extended  the  notion  of  wavefront  computing  to  include  dataflow  between  nodes  of  networks  of 
process  arrays.  This  permits  iterative  algorithms  to  be  constructed  from  wavefront-based  modules  which  can  be 
interfaced  so  that  data  flows  continuously  rather  than  having  the  end  of  an  iteration  serve  as  a de  facto  synchron- 
ization point.  Furthermore,  we  have  seen  that  interfaces  between  process  arrays  can  be  constructed  so  as  to  per- 
mit each  process  array  to  execute  on  its  own  virtual  node  array.  Thus,  iterative  algorithms  can  be  constructed 
from  routines  which  are  heterogeneous  with  respect  to  topology. 

There  are  various  ways  in  which  the  present  work  needs  to  be  extended.  In  the  future  we  plan  to  give  ex- 
amples of  iterative  algorithms  calling  matrix  multiplication  as  a subroutine,  thereby  giving  concrete  examples  of 
the  usage  of  the  machinery  developed  here.  Also,  here  and  elsewhere  the  role  of  distributed  data  structures  has 
been  observed;  in  fact  it  has  been  noted  that  algorithms  on  distributed-memory  machines  could  be  characterized 
as  mappings  between  such  structures.  This  notion  needs  to  be  developed  further,  since  it  is  a radical  departure 
from  the  classical  theory  of  algorithms  which  has  been  based  on  models  such  as  Turing  machines. 
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